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Preface 



This volume contains the papers presented at the 12th Annual Conference on 
Algorithmic Learning Theory (ALT 2001), which was held in Washington DC, 
USA, during November 25-28, 2001. The main objective of the conference is to 
provide an inter-disciplinary forum for the discussion of theoretical foundations 
of machine learning, as well as their relevance to practical applications. The 
conference was co-located with the Fourth International Conference on Discovery 
Science (DS 2001). 

The volume includes 21 contributed papers. These papers were selected by 
the program committee from 42 submissions based on clarity, significance, ori- 
ginality, and relevance to theory and practice of machine learning. 

Additionally, the volume contains the invited talks of ALT 2001 presented 
by Dana Angluin of Yale University, USA, Paul R. Cohen of the University of 
Massachusetts at Amherst, USA, and the joint invited talk for ALT 2001 and DS 
2001 presented by Setsuo Arikawa of Kyushu University, Japan. Furthermore, 
this volume includes abstracts of the invited talks for DS 2001 presented by 
Lindley Darden and Ben Shneiderman both of the University of Maryland at 
College Park, USA. The complete versions of these papers are published in the 
DS 2001 proceedings (Lecture Notes in Artificial Intelligence Vol. 2226). 

ALT has been awarding the E Mark Gold Award for the most outstanding 
paper by a student author since 1999. This year the award was given to Ke 
Yang for his paper “On Learning Correlated Boolean Functions Using Statistical 
Queries.” 

This conference was the 12th in a series of annual conferences established in 
1990. Continuation of the ALT series is supervised by its steering committee con- 
sisting of Naoki Abe (IBM Thomas J. Watson Research Center, Yorktown, USA), 
Peter Bartlett (Australian National Univ.), Klaus P. Jantke (DFKI, Germany), 
Roni Khardon (Tufts University, USA), Phil Long (National Univ. of Singapore), 
Heikki Mannila (Nokia Research Center, USA), Akira Maruoka (Tohoku Univ., 
Sendai, Japan), Luc De Raedt (Albert-Ludwigs-Univ., Freiburg, Germany), Ta- 
keshi Shinohara (Kyushu Inst, of Technology, Japan), Osamu Watanabe (Tokyo 
Inst, of Technology, Japan), Arun Sharma (co-chair, Univ. of New South Wales, 
Australia), and Thomas Zeugmann (chair, Med. Univ. of Liibeck, Germany). 

We would like to thank all individuals and institutions who contributed to 
the success of the conference: the authors for submitting papers, the invited 
speakers for accepting our invitation and lending us their insight into recent 
developments in their research areas, the sponsors, and Springer- Verlag. 

We are particularly grateful to the program committee for their work in 
reviewing papers and participating in on-line discussions. We are also grateful 
to the external referees whose reviews made a considerable contribution to this 
process. The program committee consisted of: 
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Hiroki Arimura (Kyushu Univ., Japan) 

Javed Aslam (Dartmouth College, USA) 

Peter Bartlett (Barnhill Technologies, Australia) 

Anselm Blumer (Tufts Univ., USA) 

Carlos Domingo (Synera Systems, Spain) 

Colin de la Higuera (Univ. St. Etienne, France) 

Steffen Lange (DFKI, Saarbriicken, Germany) 

Phil Long (National Univ., Singapore) 

Eric Martin (UNSW, Australia) 

Atsuyoshi Nakamura (NEC Labs., Japan) 

Vijay Raghavan (Vanderbilt Univ., USA) 

Dan Roth (UIUC, USA) 

John Shawe-Taylor (Univ. of London, UK) 

Eiji Takimoto (Tohoku Univ., Japan) 

Christino Tamon (Clarkson Univ., USA) 

We are grateful to DS 2001 chairs Masahiko Sato (Kyoto Univ., Japan), Klaus 
P. Jantke (DFKI, Germany), and Ayumi Shinohara (Kyushu Univ., Japan) for 
their effort in coordinating with ALT 2001 and to Carl Smith of the University 
of Maryland at College Park for his work as the local arrangements chair for 
both conferences. 

Finally, we would like to thank the National Science Foundation, the Office 
of Naval Research, Agilent Technologies, Avaya Labs, and the University of 
Maryland for their generous financial support for the conference. 
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Editors’ Introduction 



Learning theory is an active research area with contributions from various fields 
including artificial intelligence, theoretical computer science, and statistics. The 
main thrust is an attempt to model learning phenomena in precise ways and 
study the mathematical properties of these scenarios. In this way one hopes to 
get a better understanding of the learning scenarios and what is possible or as 
we call it learnable in each. Of course this goes with a study of algorithms that 
achieve the required performance. Learning theory aims to define reasonable 
models of phenomena and find provably successful algorithms within each such 
model. To complete the picture we also seek impossibility results showing that 
certain things are not learnable within a particular model, irrespective of the 
particular learning algorithms or methods being employed. 

Unlike computability theory, where we have a single uniform notion of what 
is computable across multiple models of computation, not all learning models 
are equivalent. This should not come as a surprise, as the range of learning 
phenomena found in nature is wide, and the way in which leaning systems are 
applied in real world engineering problems is varied. Learning models cover this 
spectrum and vary along various aspects such as parameters characterizing the 
learning environment, the learning agent, and evaluation criteria. Some of these 
are: Is there a “teacher”? Is the teacher reliable or not? Is the learner passive or 
active? Is the learning function required to be computable or efficient (polyno- 
mial time)? Is learning done on-line, getting one example at a time, or in batch? 
Is the learner required to reproduce learned concepts exactly or is it merely 
required to produce a good approximation? Are the hypotheses output by the 
learner required to be in a certain representation class or are they free of syntac- 
tic constraints? Naturally such variations lead to dramatically different results. 
Learning theory has been extensively studying these aspects getting a deeper 
understanding of underlying phenomena and better algorithms for the various 
problems. 

Over the last few years learning theory has had a direct impact on practice in 
machine learning and its various application areas, with some algorithms driving 
leading edge systems. To name a few techniques having roots in learning theory 
and that have been applied to real world problems, there are support vector 
machines, boosting techniques, on-line learning algorithms, and active learning 
methods. Each of these techniques has proven extremely effective in the respec- 
tive application areas of relevance, such as pattern recognition, web and data 
mining, information extraction, and genomics. Such developments have recently 
inspired researchers in the field to investigate the relation between theory and 
practice, by combining theory with experimental validation or applications. 

Thus the picture is not uniform, and the field is making progress by exploring 
new models to capture new problems and phenomena, and studying algorithmic 
questions within each of these models. It is with this light that the papers in 
the proceedings should be read. We have collected the papers in subgroups with 
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headings to highlight certain similar aspects either in topic or style but one 
should keep in mind that often a paper could be classified in more than one 
category. 

The invited lecture for ALT 2001 and DS 2001 by Setsuo Arikawa describes 
the Discovery Science Project in Japan which aimed to develop new methods 
for knowledge discovery, to install network environments for knowledge discov- 
ery, and to establish Discovery Science as a new area of Computer Science and 
Artificial Intelligence. Though algorithmic learning theory and machine learn- 
ing have been integrated into this project, the researchers involved took a much 
broader perspective. Their work shed new light on many problems studied in 
learning theory and machine learning and led to a fruitful interaction between 
the different research groups participating in the project. 

In her invited lecture, Dana Angluin presents a comprehensive survey of the 
state of the art of learning via membership or equivalence queries or both, a field 
she has initiated (cf.[l]). Major emphasis is put on the number of queries needed 
to learn a class of concepts. This number is related to various combinatorial 
characterizations of concept classes such as the teaching dimension, the exclu- 
sion dimension, the extended teaching dimension, the fingerprint dimension, the 
sample exclusion dimension, the well-known Vapnik-Chervonenkis dimension, 
the abstract identification dimension, and the general dimension. Each of these 
dimensions emphasises a different view on the learning problem and leads to a 
better understanding of what facilitates or complicates query learning. 

Robot Baby 2001 is presented by Paul R. Cohen et al. in his invited lec- 
ture. This paper provides strong evidence that meaningful representations are 
learnable by programs. Different notions of meaning are discussed, and special 
emphasis is put on a functional notion of meaning being appropriate for pro- 
grams to learn. Several interesting algorithms are provided and experimental 
results of their application are surveyed. This work raises an interesting chal- 
lenge of deriving theoretical results capturing aspects of practical relevance thus 
tightening the relation between theory and practice. 

Papers in the first section deal with complexity aspects of learning. Yang 
studies learnability within the statistical query (SQ) model introduced by Kearns 
PH. This model captures one natural way to obtain noise robustness when exam- 
ples are drawn identically and independently distributed (i.i.d.) from a fixed but 
unknown distribution, as in the well known PAC model m- In this case, if the 
algorithm does not rely directly on specific examples but rather on measurable 
statistical properties of examples then it is guaranteed to be robust to classifica- 
tion noise m- Yang studies learnability when the class of concepts in question 
includes highly correlated concepts, showing a lower bound in terms of the de- 
sired performance accuracy. This yields non-learnability results in the SQ model 
for a certain class of concepts which are contrasted with the PAC learnability of 
the same class. This result provides an interesting example for separating these 
learning models, since except for classes of parity functions practically all known 
PAC learnable classes have been shown to be SQ learnable as well. 
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The name boosting captures a class of algorithms that use “weak” learners 
(which provide good but imperfect hypotheses) to build more accurate hypothe- 
ses |8]. The idea is to combine several runs of the weak learner in the process 
and techniques vary in the way this is done. It was recently observed that 

the well known decision tree learning algorithms m can be viewed as boost- 
ing algorithms. Hatano improves these results by providing a tighter analysis of 
decision tree learning as boosting when the trees include splits with more than 
two branches. 

It is well known m that an algorithm which uses hypotheses of bounded 
capacity and finds a hypothesis consistent with training data learns the concept 
class in question in the PAC learning model. Similar results hold for algorithms 
that minimize the number of inconsistencies with training data. Various negative 
results for neural networks showing that the above is infeasile have been derived. 
Sima continues this line by showing that for the sigmoid activation function even 
approximating the minimum training error is intractable. This is important as 
the popular backpropagation algorithm uses a gradient method to try to optimize 
exactly this function. 

Support vector machines (SVM) |5I6| use a neural model with a single neuron 
and threshold activation function but combine two aspects to overcome learning 
difficulties with neural models. First, the input domain is implicitly enhanced 
to include a large number of possibly useful features. Second, the algorithm 
uses optimization techniques to find a threshold function of “maximum margin” 
providing robustness, since examples are not close to the decision surface of 
the hyperplane. The first aspect is done using the so called kernel functions 
which are used directly in the optimization procedure so that the enhanced 
features are not produced explicitly. Sadohara presents a kernel appropriate for 
learning over Boolean domains. Using this kernel is equivalent to learning a 
threshold element where features are all conjunctions of the basic feature set. 
This is highly relevant to the problem of learning DNF expressions; a question 
which has received considerable attention in learning theory. The paper also 
present experiments demonstrating that SVM using this kernel performs well 
and compares favorably with other systems on Boolean data. 

Balcazar et al. propose a sampling based algorithm for solving the quadratic 
optimization problem involved in SVM (albeit not to the kernel construction). 
The intention is to improve complexity in cases where the number of examples 
is large. The basic idea is to use random sub-samples from the data set where 
the distribution is carefully controlled to iteratively improve the solution so that 
convergence is guaranteed in a small number of rounds. Algorithms are proposed 
both for the separable case and the “noisy” non-separable case, and practical 
complexity aspects of the methods are developed. 

The next section introduces models which try to capture new aspect of learn- 
ing phenomena and study the complexity of learning within these. Garg and 
Roth introduce the notion of coherence constraint. The idea is that several con- 
cepts exist over the instance space and they are correlated. This correlation is 
captured by a coherency constraint which effectively implies that certain inputs 
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(on which the constraint is violated) are not possible. The paper shows that 
the existence of such constraints implies reduced learning complexity in several 
scenarios. Experiments illustrate that this indeed happens in practice as well. 

Kwek introduces a model where several concepts are learned simultaneously 
and where some concepts may depend on others. Since intermediate concepts are 
used as features in other concept descriptions the overall descriptional power is 
much higher. Nevertheless the paper shows that learnability of the representation 
class used for each concept implies learnability of all concepts in several models 
e.g. the PAC model. When the learner is active, that is when membership queries 
are allowed, this does not hold. 

Dooly et al. study the so-called multiple instance learning problem. In this 
problem, motivated by molecular drug activity applications, each example is 
described as a set of configurations one of which is responsible for the observed 
activity or label (cf. [zl)- The paper studies the case where the “label” is real 
valued giving a quantitative rather than binary measure of the activity. Negative 
results on learning from examples are presented and a new model of active 
learning, allowing membership or value queries is shown to allow learnability. 

On-line prediction games provide a model for evaluating learners or predic- 
tion strategies under very general conditions |16| . In this framework an iterative 
game is played where in each step the learner makes a prediction, observes the 
true value and suffers a loss based on these two values and a fixed loss function. 
The prediction complexity of a sequence gives a lower bound on the loss of any 
prediction algorithm for the sequence and can thus be seen as another way to 
characterize the inherent complexity of strings. The paper by Kalnishkan et al. 
derives results on the average complexity when the sequence is i.i.d. Bernoulli 
and relates this to the information complexity of the sequence. As a result it is 
shown that the Kolmogorov complexity does not coincide with the prediction 
complexity for the binary game. The paper by Vyugin and V’yugin studies the 
relation between Kolmogorov and predictive complexity. In particular, a gap is 
established which depends logarithmically on the length of the strings. 

Research in inductive inference follows seminal work by Gold [H] who in- 
troduced the model of learning in the limit. Here finite complexity rather than 
polynomial complexity defines the notion of feasibility. For concept learning, 
Gold defined two learning models, i.e., Text where the learner sees only positive 
examples of the underlying concept and Informant where both positive and neg- 
ative examples are seen. Jain and Stephan study several intermediate models 
based on the strategy of information presentation where the learner switches 
from asking to see positive to negative examples or vice versa. A hierarchy be- 
tween the notions is established, and a more refined hierarchy is shown in case 
that the number of times the learner switches example types is limited. 

In a second paper Jain and Stephan study the problem of learning to sep- 
arate pairs of disjoint sets which do not necessarily cover the instance space. 
Several restrictions on the learner are studied within this framework, for exam- 
ple: conservative learners who only abandon hypotheses which were contradicted 
by data, and set-driven learners whose hypotheses do exclusively depend on the 
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range of the input. The effect of these restrictions and their interaction is exten- 
sively studied. The two notions mentioned here are not comparable if the learner 
converges on all data sequences. 

Jain et al. study a related model of learning union of languages. Two main 
variants are discussed where the learner either needs to identify the union or is 
required to identify the element languages composed in the union. Several results 
relating the strength of the models are derived establishing hierarchies when the 
number of languages in the union is increased, as well as identifying complete 
problems under appropriate reductions for these classes. 

Zilles studies a notion of meta-learning where a single learning algorithm can 
learn several concept classes by being given an index of the class as a parameter. 
Two scenarios are discussed where the learner either uses a single representation 
for hypotheses in all classes or can change the representation with the class. The 
paper studies the effect of restricting the learner, for example to be conservative 
as described above, on the learnable classes. Various separation results are given 
using finite concept classes to separate the models. 

The next group of papers also deals with inductive inference. The common 
aspect studied by all these papers is the notion of refutation originally intro- 
duced in |13| , which was in part motivated by the design of automatic discovery 
systems, in which the choice of the hypothesis class is a critical parameter. In 
their paper |13| . the following scenario is considered. The learner is given a hy- 
pothesis space of uniformly recursive concepts in advance. Whenever the target 
concept can be correctly described by a member of this hypothesis space, then 
the learner has to identify it in the limit. If, however, the learner is fed data of a 
target concept that has no correct description within the hypothesis space given, 
then the learner has to refute the whole hypothesis space after a finite amount 
of time by outputting a special refutation symbol and stopping the learning pro- 
cess. Thus, within the model of learning refutably, the learner either identifies a 
target concept or itself indicates its inability to do so. 

Mukouchi and Sato extend the original approach by relaxing the correctness 
criterion and by allowing noisy examples. In their new model, the learner must 
succeed to infer a target concept provided it has a (weak) fc-neighbor in the 
hypothesis space; otherwise it has again to refute the hypothesis space. Here a 
(weak) fc-neighbor is defined in terms of a distance over strings. 

Jain et al. study several variations of learning refutably. Now, the target 
concepts are drawn from the set of all recursive functions and an acceptable pro- 
gramming system is given as hypothesis space. Thus, it is no longer appropriate 
to refute the whole hypothesis space, since it contains a correct description for 
every target. Nevertheless, the learner may not be able to solve its learning task. 
This can be indicated by either outputting the refuting symbol and stopping the 
learning process, or by converging to the refutation symbol, or by outputting 
the refutation symbol infinitely often. All these models of learning refutably are 
studied, related to one another as well as their combination with other, previ- 
ously studied learning models within the setting of inductive inference. 
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Last but not least within this group of papers, Merkle and Stephan extend 
the original notion of learning with refutation from positive data by introducing 
the notion of refutation in the limit and by considerably extending the notion of 
Text. Now, the data-sequences are sequences of first-order sentences describing 
the target. Several new and interesting hierarchies are then established. 

The papers in the last section study learnability of formal languages and 
structured data. Arimura et al. consider the problem of identifying tree patterns 
in marked data. The paper extends previous work by allowing gaps in the tree 
pattern, that is, internal sub-trees may be skipped when matching a tree pattern 
to an example tree. The paper shows that the task is solvable in polynomial time 
in an active learning setting where the learner can ask queries. 

Elementary formal systems are similar to logic programs operating on strings 
and having a distinguished unary predicate. The true atoms for this predicate 
define a formal language. Lange et al. extend work on learnability [2 of such 
systems to allow negation in the logic program. The extension is done along 
the lines of stratified logic programs. Learnability is studied and compared to 
the case before the extension. In Gold’s paradigm some positive results do not 
transfer to the extended systems, but in the PAG model the main known positive 
result is shown to hold for the extended systems. 

The problem of learning regular languages has been extensively studied with 
several representation schemes. Dennis et al. study learnability of regular lan- 
guages using a non-deterministic representation based on residuals — comple- 
tion languages for prefixes of words in the language. While the representation 
is shown not to be polynomially learnable, parameters of the representation are 
studied empirically and these suggest a new learning algorithm with desirable 
properties. Experiments show that the algorithm compares favorably with other 
systems. 

The class of Biichi automata defines languages over infinite strings. When 
modeling learnability of such languages one is faced with the question of exam- 
ples of infinite size. The paper by de la Higuera and Janodet introduces a model 
of learning such languages from finite prefixes of examples. While the complete 
class is not learnable a sub-class is identified and shown learnable in the limit 
with polynomial update on each example. 

As the above descriptions of papers demonstrate, the range of learning prob- 
lems and issues addressed by the papers in this volume is rich and varied. While 
we have partitioned the papers mainly according to the techniques used, they 
could have been classified according to the classes of objects that are being 
learned. These include representations of formal languages, recursive functions, 
Boolean concepts over Boolean domains, real valued functions, neural networks, 
and kernel based SVM. Several of the papers also combine the theoretical study 
with an empirical investigation or validation of new ideas, and it would also be 
beneficial to classify them according to the types of relevant application areas. 

While the papers in this volume will surely not give an exhaustive list of 
problems addressed and types of theory developed in learning theory in general. 
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it is our hope that they will give an idea of where the field stands currently and 
where it may be going in the future. 
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Abstract. The Discovery Science project in Japan in which more than 
sixty scientists participated was a three-year project sponsored by Grant- 
in- Aid for Scientific Research on Priority Area from the Ministry of Edu- 
cation, Culture, Sports, Science and Technology (MEXT) of Japan. This 
project mainly aimed to (1) develop new methods for knowledge discov- 
ery, (2) install network environments for knowledge discovery, and (3) 
establish Discovery Science as a new area of Computer Science / Artifi- 
cial Intelligence Study. 



In order to attain these aims we set up five groups for studying the following 
research areas: 

(A) Logic for/of Knowledge Discovery 

(B) Knowledge Discovery by Inference/Reasoning 

(C) Knowledge Discovery Based on Computational Learning Theory 

(D) Knowledge Discovery in Huge Database and Data Mining 

(E) Knowledge Discovery in Network Environments 

These research areas and related topics can be regarded as a preliminary def- 
inition of Discovery Science by enumeration. Thus Discovery Science ranges over 
philosophy, logic, reasoning, computational learning and system developments. 

In addition to these five research groups we organized a steering group for 
planning, adjustment and evaluation of the project. The steering group, chaired 
by the principal investigator of the project, consists of leaders of the five research 
groups and their subgroups as well as advisors from the outside of the project. 
We invited three scientists to consider the Discovery Science overlooking the 
above five research areas from viewpoints of knowledge science, natural language 
processing, and image processing, respectively. 

The group A studied discovery from a very broad perspective, taking into 
account of historical and social aspects of discovery, and computational and log- 
ical aspects of discovery. The group B focused on the role of inference/reasoning 
in knowledge discovery, and obtained many results on both theory and practice 
on statistical abduction, inductive logic programming and inductive inference. 
The group C aimed to propose and develop computational models and method- 
ologies for knowledge discovery mainly based on computational learning theory. 
This group obtained some deep theoretical results on boosting of learning al- 
gorithms and the minimax strategy for Gaussian density estimation, and also 
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methodologies specialized to concrete problems such as algorithm for finding 
best subsequence patterns, biological sequence compression algorithm, text cat- 
egorization, and MDL-based compression. The group D aimed to create compu- 
tational strategy for speeding up the discovery process in total. For this purpose, 
the group D was organized with researchers working in scientific domains and 
researchers from computer science so that real issues in the discovery process 
can be exposed out and practical computational techniques can be devised and 
tested for solving these real issues. This group handled many kinds of data: data 
from national projects such as genomic data and satellite observations, data gen- 
erated from laboratory experiments, data collected from personal interests such 
as literature and medical records, data collected in business and marketing ar- 
eas, and data for proving the efficiency of algorithms such as UCI repository. So 
many theoretical and practical results were obtained on such a variety of data. 
The group E aimed to develop a unified media system for knowledge discovery 
and network agents for knowledge discovery. This group obtained practical re- 
sults on a new virtual materialization of DB records and scientific computations 
that help scientists to make a scientific discovery, a convenient visualization in- 
terface that treats web data, and an efficient algorithm that extracts important 
information from semi-structured data in the web space. 

This lecture describes an outline of our project and the main results as well 
as how the project was prepared. We have published and are publishing special 
issues on our project from several journals [5], [6], [7], [8], [9], [10]. As an activity 
of the project we organized and sponsored Discovery Science Conference for 
three years where many papers were presented by our members [2], [3], [4]. We 
also published annual progress reports [1], which were distributed at the DS 
conferences. We are publishing the final technical report as an LNAI [11]. 
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Abstract. We begin with a brief tutorial on the problem of learning a fi- 
nite concept class over a finite domain using membership queries and/or 
equivalence queries. We then sketch general results on the number of 
queries needed to learn a class of concepts, focusing on the various no- 
tions of combinatorial dimension that have been employed, including the 
teaching dimension, the exclusion dimension, the extended teaching di- 
mension, the fingerprint dimension, the sample exclusion dimension, the 
Vapnik-Chervonenkis dimension, the abstract identification dimension, 
and the general dimension. 



1 Introduction 

Formal models of learning reflect a variety of differences in tasks, sources of 
information, prior knowledge and capabilities of the learner, and criteria of suc- 
cessful performance. In the model of exact identification with queries [T], the 
task is to identify an unknown concept drawn from a known concept class us- 
ing queries to gather information about the unknown concept. The two most 
studied types of queries are membership queries and equivalence queries. In a 
membership query, the learner asks if a particular domain element is included 
in the unknown concept or not. In an equivalence query, the learner proposes a 
particular concept, and is told either that the proposed concept is the same as 
the unknown concept, or is given a counterexample, that is, a domain element 
that is classified differently by the proposed concept and the unknown concept. 
If there are several possible counterexamples, the choice of which one to present 
is generally assumed to be made adversarially. 

Researchers have invented a wonderful variety of ingenious and beautiful 
polynomial-time learning algorithms that use queries to achieve exact identifi- 
cation of different classes of concepts, as well as important modifications of the 
basic model to incorporate more realism, e.g., background knowledge and errors. 
However, this survey will focus on the question of how many queries are needed 
to learn different classes of concepts, ignoring other computational costs. The 
analogous question in the PAC model [T^ is how many examples are needed to 
learn different classes of concepts. In the case of the PAC model, bounds in terms 
of a combinatorial property of the concept class called the Vapnik-Chervonenkis 
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dimension early provided a satisfying answer |7l8j . In the case of learning with 
queries, the development has been both more gradual and more variegated. 



2 Preliminaries 

The domain X is a nonempty finite set. A concept is any subset of X, and 
a concept class is any nonempty set of concepts. We ignore the issues of how 
concepts and domain elements are represented. We distinguish certain useful 
concept classes: the class 2^ of all subsets of X, and the class <S'(W) of singleton 
subsets of X. We also define S'+(X), the class S'(AT) together with the empty 
set. 

One way to visualize a domain X and a concept class C is as a binary matrix 
whose rows are indexed by the concepts, say ci, C 2 , . . . , cm, and whose columns 
are indexed by the elements of X, say, xi,X 2 , ■ ■ ■ ,xn, and whose {i,j) entry is 
1 if Xj € Ci and 0 otherwise. An example is given in Figure [D 



X\ X2 X3 
Cl 1 0 1 

C2 0 0 1 

C3 1 1 0 

C4 1 0 0 



Fig. 1. Matrix representation of the concept class Co = {01,02,03,04}, where Ci = 
{o;i,a;3}, 02 = {2:3}, C3 = {x-i_,X2}, 04 = {a;i} 

The rows (representing concepts) are all distinct, though the columns need 
not be. For our purposes the columns (representing domain elements) may also 
be assumed to be distinct, because there is no point in distinguishing between 
elements x and x' that are contained in exactly the same set of concepts. Thus, 
a domain and concept class can be represented simply as a finite binary relation 
whose rows are distinct and whose columns are distinct. This makes clear the 
symmetry of the roles of the domain and the concept class. 

For any concept c C A we define two basic types of queries with respect to 
c. In a membership query, the input is an element x G X, and the output is 1 if 
X G c and 0 if x ^ c. In an equivalence query, the input is a concept c' C X, and 
the output is either “yes,” if c' = c, or an element x in the symmetric difference 
of c and c', if c' ^ c. Such an element x is a counterexample. The choice of a 
counterexample is nondeterministic. 

A learning problem is specified by giving the domain X, the class of con- 
cepts C, and the permitted types of queries. The task of a learning algorithm is 
to identify an unknown concept c drawn from C using the permitted types of 
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queries. Because we ignore computational resources other than the number of 
queries, we use decision trees to model learning algorithms. 

A learning algorithm over A1 is a finite rooted tree that may have two types 
of internal nodes. A membership query node is labelled by an element x G X 
and has two outgoing edges, labelled by 0 and 1. An equivalence query node is 
labelled by a concept c C X and has |AT| + 1 outgoing edges, labelled by “yes” 
and the elements of X. The leaf nodes are unlabelled. An example of a learning 
algorithm Tg that uses only membership queries is given in Figure El 




Fig. 2. MQ-algorithm To over domain X = {xi, X 2 , xs} 



Given a learning algorithm T and a concept class C, we recursively define 
the evaluation of T on C as follows. Each node of T will be assigned the subset 
of C consistent with the answers to queries along the path from the root to that 
node. 

The root node is assigned C itself. Suppose an internal node v has been 
assigned the subset C of C. If u is a membership query labelled by x, then the 
0-child of V is assigned the subset of C consisting of concepts c such that x ^ c, 
and the 1-child of v is assigned the subset of C consisting of concepts c such 
that cc S c. In this case, the set C is partitioned between the two children of v. If 
V is an equivalence query labelled by c', then for each x € X, the a;-child of v is 
assigned the subset of C" consisting of concepts c such that x is in the symmetric 
difference of c' and c. The “yes”-child of v is assigned the singleton {c'} if c' G C", 
otherwise it is assigned the empty set. In this case, we do not necessarily have 
a partition; a concept in C' may be assigned to several of the children of v. The 
assignment produced by evaluation of the tree Tg on the concept class Cg is 
shown in Figure El 

A learning algorithm T is successful for a class of concepts C if in the evalu- 
ation of T on C, there is no leaf £ of T such that two distinct concepts c,c' G C 
are assigned to £. This implies that T has at least \C\ leaves, because in the eval- 
uation of T on C each element of C is assigned to at least one leaf, and no two 
elements of C are assigned to the same leaf of T. It also implies that the decision 
tree T may be used to identify an unknown concept c S C by asking queries 
starting with the root and following the edges corresponding to the answers. 
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Fig. 3. Assignment produced by evaluation of To from Figure El on Co 

until a leaf is reached, at which point exactly one concept c G C is consistent 
with the answers received. 

Let T be a learning algorithm over X. The depth ofT, denoted d{T) is the 
maximum number of edges in any path from the root to a leaf of T. Let c C X 
be any concept. The depth of c in T, denoted d{c,T), is the maximum number 
of edges in a path from the root to any leaf assigned c in the evaluation of T on 
the class {c}. This is the worst-case number of queries used by the algorithm T 
in identifying c. Figure |3shows that Tq is successful for Co, and d{c 4 ,To) = 3. 

3 Membership Queries Only 

A MQ-algorithm uses only membership queries. The partition property of mem- 
bership queries implies that every concept is assigned to just one leaf of a MQ- 
algorithm. If a MQ-algorithm T is successful for a concept class C, then 

log|C|<d(T), (1) 

because T is a binary tree with at least |C| leaves. 

Let Tmq{C) denote the set of MQ-algorithms T that are successful for C, and 
have no leaf assigned 0 in the evaluation of C . To see that Tmq{C) is nonempty, 
consider the exhaustive MQ-algorithm that systematically queries every element 
of X in turn. Certainly, no two concepts are assigned to the same leaf, although 
some leaves may be assigned 0. If so, redundant queries may be pruned until every 
leaf is assigned exactly one concept from C. This MQ-algorithm is successful for 
every concept class over X. Its depth is at most \X\. 

Define the MQ-cost of a class C of concepts over X, denoted #MQ(C), as 

Then 

log|C| <#MQ(C) < |A|, (3) 

because any MQ-algorithm successful on C has depth at least log |C|, and the 
exhaustive MQ-algorithm has depth |A|. For the class 2^, the upper and lower 
bounds are equal. MQ-algorithms are equivalent to the mistake trees of Little- 
stone C3- 
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4 Equivalence Queries Only 

Consider the learning algorithm T\ , which uses only equivalence queries over X, 
presented in Figure S] The evaluation of T\ on the concept class Co is presented 
in Figure E] 




Fig. 4. Equivalence query algorithm T\ 




Fig. 5. Evaluation of Ti on Co 



The algorithm T\ is successful for Co, because no two concepts from Co are 
assigned to the same leaf of Ti . The concept C2 is assigned to two different leaves 
of Ti, illustrating the non-partition property of equivalence queries. 

Given a concept class C, a proper equivalence query with respect to C is 
an equivalence query that uses an element c G C. We use the notation EQ 
for equivalence queries proper with respect to a class C, and XEQ for extended 
equivalence queries, which are unrestricted. A useful generalization allows equiv- 
alence queries from a hypothesis class H containing C, but for simplicity we do 
not pursue that option. The equivalence queries in T\ involve only concepts that 
are elements of Cq, namely C4 = {^3} and C\ = {xi,x^}. Consequently, we say 
that T\ is an EQ-algorithm for Cq. 

Given a concept class C, let Teq{C) denote the set of EQ-algorithms suc- 
cessful for C, and let Txeq{C) denote the set of XEQ-algorithms successful for 
C. Clearly, Teq{C) C Txeq{C). To see that Teq{C) is nonempty, consider the 
exhaustive EQ-algorithm for C, which consists of making an equivalence query 
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with every element of C, except one, in some order. This gives an EQ-algorithm 
of depth |C| — 1 that is successful for C. 

Define 

= (4) 

and 

#XEQ(C) = min maxd(c,r). (5) 

T^Txeq(C) cGC 

For every concept class C, 

#XEQ(C) < #EQ(C) < |C| - 1. (6) 

For the class of singletons, 5'(X), a simple adversary argument shows that 
#EQ(S'(X)) = |X| — 1, attaining the upper bound above. For the same class, 
a single XEQ with the empty set discloses the identity of the target concept, 
therefore, #XEQ(5'(X)) = 1. 



5 Membership and Equivalence Queries 

Algorithms may involve both membership and equivalence queries. We distin- 
guish MQ&EQ-algorithms, in which all the equivalence queries are proper for the 
concept class under consideration, from MQ&XEQ-algorithms, in which there 
is no restriction on the equivalence queries. Figure [6] shows T 2 , a MQ&EQ- 
algorithm that is successful for the class Cq. 




Fig. 6. MQ&EQ-algorithm T2 for the concept class Co 



Let Tmqueq{C) denote the set of MQ&EQ-algorithms successful for the 
concept class C. Define 

#MQ&EQ(C) = min maxd(c,T). (7) 

T^TmqScEq{C) cGC 

For any concept class C, because MQ&EQ-algorithms have both types of 
queries available, we have the following inequalities. 

#MQ&EQ(C) < #MQ(C), (8) 



and 
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#MQ&EQ(C) < #EQ(C). (9) 

For the concept class C, let TMQhXEQ{C) denote the set of MQ&XEQ- 
algorithms successful for C. Define 

#MQ&XEQ(C) = min maxd(c,T). (10) 

^GTmq&xeq(C) cGC 

Clearly, 

#MQ&XEQ(C') < #MQ&EQ(C'). (11) 

6 XEQ’s, Majority Vote, and Halving 

The first query in the algorithm T\ is very productive, in the sense that no child 
of the root is assigned more than half the concepts in Cq. The existence of such 
a productive query is fortuitous in the case of EQ’s, but is guaranteed in the 
case of XEQ’s. In particular, for any class C of concepts over X, we define the 
majority vote of C, denoted Cm(C'), as follows. 

(Vx G X)[x G Cm{C) ^ \{c' €C:x€ d}\ > \C\/2\. (12) 

That is, an element x is placed in Cm(C') if and only if more than half the 
concepts in C contain x. Thus, any counterexample to the majority vote concept 
eliminates at least half the possible concepts. The majority vote concept for Cq 
is {xi} = C 4 . 

The halving algorithm for C may be described as follows. Starting with the 
root, construct the tree of XEQ’s and the evaluation of the tree on C concur- 
rently. If there is a leaf assigned C and C has cardinality more than 1 , then 
extend the tree and its evaluation on C by replacing the leaf with an XEQ la- 
belled by the majority vote concept, Cm(C'). Because the set of concepts assigned 
to a node can be no more than half of the concepts assigned its parent, no path 
in the tree can contain more than [log |C|J XEQ’s. Thus, for any concept class 
C 

#XEQ(C) < [log|C|J. (13) 

The halving algorithm is good, but not necessarily optimal USES]. 

7 An Optimal XEQ- Algorithm 

Littlestone | 15| defines the standard optimal algorithm, which achieves an XEQ- 
algorithm of depth #XEQ(C) for any concept class C. He proves the non-obvious 
result that 

#XEQ(C) = max mind(c,r). (14) 

T^Tmq{C) cGC 

That is, the optimal number of XEQ’s to learn a class C is the largest d such 
that there is a MQ-algorithm successful for C in which the depth of each leaf is 
at least d. 



Queries Revisited 



19 



Maass and Turan jl6] use this result to show that 

#XEQ(C)/log(#XEQ(C) + 1) < #MQ&XEQ(C). (15) 

This shows that the addition of MQ’s cannot produce too much of an improve- 
ment over XEQ’s alone. 

To prove this, let d = #XEQ(C) and consider a MQ-algorithm T successful 
for C such that every leaf is at depth at least d. Let V denote the set of nodes 
V at depth d in T such that a concept c G C assigned to a descendant of v is 
consistent with the all the replies to queries so far. Initially, V contains 2‘^ nodes. 

An adversary answers MQ’s and XEQ’s so as to preserve at least a fraction 
l/{d -I- 1) of E as follows. For a membership query with element x, if at least 
half the current elements would be preserved by the answer 1, then answer 1, 
else answer 0. For an equivalence query with the concept d (not necessarily in 
C), consider the node v at depth dinT that c' is assigned to. If we consider the 
d nodes that are siblings of nodes along the path from the root to v, at least one 
of them, say v' must account for a fraction of at least l/(d -I- 1) of the current 
elements of V. If the label on the parent of u' is x, then answer the equivalence 
query with x. Thus, after j queries, there are at least 2‘^/(d+ 1)^ elements left 
in V. Thus, the adversary forces at least d/log(fi-|- 1) queries. 

8 Dimensions of Exact Learning 

In this section we consider some of the dimensions introduced to bound the cost 
of learning a concept class with various combinations of queries. For some of 
these we have suggested different names, to try to bring out the relationships 
between these definitions. 



8.1 The Teaching Dimension 

Given a concept class C and a concept c G C, a, teaching set for c with respect to 
C is a set S' C X such that no other concept in C classifies all the examples in S 
the same way c does. For example, {xi^x^} is a teaching set for the concept ci 
with respect to Cq because no other concept in Cq contains both elements. Also, 
{a:i} is a teaching set for the concept C 2 with respect to Cq because every other 
concept in Cq contains Xi. If a learner is presented with an unknown concept 
from C, by making membership queries for each of the elements in a teaching 
set for c, the learner can verify whether or not the unknown concept is c. 

The teaching dimension of a concept class C, denoted TD(C'), is the maxi- 
mum over all c S (7 of the minimum size of a teaching set for c with respect to C 
ffloTTsl . It is the worst case number of examples a teacher might have to present 
to a learner of a concept c G C to eliminate all other possible concepts in C. 

Examples. The teaching dimension of Cq is 2. The teaching dimension of 
S'(A), the set of singletons over X, is 1 because each set contains an element 
unique to that set. However, the teaching dimension of S'“''(X), the singletons 
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together with the empty set, is |X|, because the only teaching set for the empty 
set in this situation is X itself. 

In terms of MQ-algorithms, we have 

TD(C) = max min dic.T). (16) 

cGC TgTmq{C) 

This is true because the labels on any path from the root to a leaf assigned c in 
a MQ-algorithm successful for C constitute a teaching set for c with respect to 
C - any other concept in C must disagree with c on at least one of them, or it 
would have been assigned to the same leaf as c. Conversely, given a teaching set 
S for c with respect to C, we can construct a MQ-algorithm that asks queries 
for those elements first, stopping if the answers are those for c, and continuing 
exhaustively otherwise. This will produce a MQ-algorithm successful for C in 
which c is assigned to a leaf at depth jS”!. Thus the minimization finds the size 
of the smallest teaching set for a given c, and this is maximized over c € C. 

Note that the max and min operations are exchanged in the two equations 
© and (Ei, and therefore by the properties of max and min, 

TD(C) < #MQ(C). (17) 

(Note that |2|) , (I14|l and (1161 involve three out of the four possible combinations 
of max, min, c G C, and T G Tmq{C).) 

8.2 The Exclusion Dimension 

The teaching dimension puts a lower bound on the number of examples a teacher 
may need to convince a skeptical student of the identity of a concept in C. What 
about concepts not in Cl For a concept c' ^ (7, how many examples does it 
take to prove that the concept is not in Cl For technical reasons, we consider a 
slightly different notion, namely, the number of examples to reduce to at most 
one the set of concepts in C that agree with d on the examples. 

If C is a class of concepts and d an arbitrary concept, then a specifying set 
for d with respect to C is a set S of examples such that at most one concept 
c G C agrees with the classification of d for all the elements of S. If d G C, then 
a specifying set for d with respect to C is just a teaching set for d with respect 
to C. 

Suppose d ^ C and suppose S' is a specifying set for d with respect to C. 
There are two possibilities: either there is no concept c G C that agrees with the 
classification of d for every example in S, or there is exactly one such concept 
c G C. If there is one such, say c, we can add to S a single example on which d 
and c disagree to construct a set S' such that no concept in C agrees with the 
classification of d on every example in S'. Thus, a specifying set may require at 
most one more example to become a “proof” that d ^ C. 

Define the exclusion dimension^ denoted XD(C), of a concept class C as the 
maximum over all concepts d ^ C, of the minimum size of any specifying set 
for d with respect to C. If C = 2^, define the exclusion dimension of C to be 0. 
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This is the same as the unique specification dimension of Hegediis [12] and the 
certificate size of Hellerstein et al. |13] . 

Examples. XD(S'(X)) = |X| — 1 because for the empty set we must specify 
|X| — 1 examples as not belonging to the empty set to reduce the possible concepts 
to at most one (the singleton containing the element not specified.) However, 
for |X| > 2, XD(S'+(X)) = 1 because any concept not in contains at 

least two elements, and specifying that one of them belongs to the concept is 
enough to rule out the empty set and all but one singleton subset of X. We 
have XD(Co) = 1, because each of the concepts not in Cq has a specifying set of 
size 1. For example, the empty set has a specifying set {xi} with respect to Cq, 
because only C 2 also does not include and the set {xi,X 2 , has a specifying 
set {a; 2 } with respect to Cq, because only C 3 also includes X 2 - 

The argument for m generalizes to give 

XD(C) = max min did .T). (18) 

c'^CT(^Tmq(C) 

Let T be any MQ-algorithm that is successful for C. Consider any concept 
c' ^ C, the leaf £ of T that d is assigned to, and the set S of elements queried 
on the path from the root to £. Because at most one element of C is assigned to 
£, S' is a specifying set for d . 

Conversely, \i d C and S is a specifying set for c', then we may construct 
a MQ-algorithm successful for C by querying the elements of S. If an answer 
disagrees with the classification by c', then continue with the exhaustive MQ 
algorithm. If the answers for all the elements of S agree with the classifications 
by c', then there is at most one concept in C consistent with those answers, and 
the algorithm may halt. 

Hence, the smallest specifying set for d has size equal to the minimum depth 
of d in any MQ-tree successful for C, and m follows. 

Also, for any concept class C, 

XD(C) < #MQ(C). (19) 

Consider any MQ-tree of depth #MQ(C) that is successful for C. Every 
d ^ C has a specifying set consisting of the elements queried along the path in 
T that d is assigned to, which is therefore of size at most #MQ(C). 

8.3 The Extended Teaching Dimension 

The combination of the teaching dimension and the exclusion dimension yields 
the extended teaching dimension m- The extended teaching dimension of a 
concept class C, denoted XTD(C), is the maximum over all concepts d C X, of 
the minimum size of any specifying set for d with respect to C. Clearly, for any 
concept class C, 

XTD(C) =max{TD(C),XD(C)}. (20) 

From (HSI and CHI we have 

XTD(C) = max min dic.T). (21) 

cG2VTgTmq(C) 
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From (E) and (HSI, we have 

XTD(C) < #MQ(C). (22) 

Examples. XTD(Co) = 2 = max{2,l}. If |X| > 2, XTD(S'(X)) = |X| - 1 
and XTD(S'+(X)) = |X|. 

9 The Testing Perspective 

In the simplest testing framework there is an unknown item, for example, a 
disease, and a number of possible binary tests to perform to try to identify 
the unknown item. There is a finite binary relation between the possible items 
and the possible tests; performing a test on the unknown item is analogous 
to a membership query, and adaptive testing algorithms correspond to MQ- 
algorithms. Hence the applicability of Moshkov’s results on testing to questions 
about MQ-algorithms. The frameworks are not completely parallel. Moshkov 
introduces the analog of equivalence queries for the testing framework m 

We take a brief excursion to consider the computational difficulty of the 
problem of constructing an optimal testing algorithm (or, equivalently, MQ- 
algorithm.) There is a natural (and expensive) dynamic programming method 
for constructing an optimal MQ-algorithm. Hyafil and Rivest show that it is 
NP-complete to decide, given a binary relation and a depth bound, whether 
the relation has a MQ-algorithm with at most that depth IH]. Arkin et al. (S] 
consider this problem in the context of the number of probes needed to determine 
which one of a finite set of geometric figures is present in an image. They prove 
an approximation result for the natural (and efficient) greedy algorithm for this 
problem, which we now describe. 

An MQ-algorithm and its evaluation on C are constructed top-down and 
simultaneously. For each leaf node assigned more than one concept from C, 
choose a membership query that partitions the set of concepts assigned to the 
node as evenly as possible, and extend the tree and its evaluation until every leaf 
node is assigned exactly one concept from C . Arkin et al. show that this method 
achieves a tree whose height is within a factor of [log |C|] of the optimal height. 
(This greedy tree-construction method is a standard one in the literature of 
constructing decision trees from given example classifications, although decision 
trees compute classifications rather than identifications.) 

10 XTD and MQ-Algorithms 

Using a specifying set S for a concept c', we can replace an equivalence query 
with c' by a sequence of membership queries with the elements of S as follows. 
If a membership query with x gives an answer different from the classification 
by c' , we proceed as though the equivalence query received counterexample x in 
reply. If the answers for all the elements of S are the same as the classifications 
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by c', then at most one element of C is consistent with all these answers, and 
the learning algorithm can safely halt. 

If we apply this basic method to replace each XEQ of the halving algorithm 
by a sequence of at most XTD(C) MQ’s, we get the following for any concept 
class C. 

#MQ(C) < (XTD(C)) • (Llog|C|J). (23) 

We could instead replace each XEQ in the standard optimal algorithm by a 
sequence of at most XTD(C') MQ’s to obtain 

#MQ(C) < (XTD(C)) • (#XEQ(C)). (24) 

Hegediis m gives an improvement over dZ3), achieved by an algorithm with 
a greedy ordering of the MQ’s used in the simulation of one XEQ. 

#MQ(C) < (2XTD(C)/(logXTD(C))) • (Llog|C|J). (25) 

He also gives an example of a family of concept classes for which this improved 
bound is asymptotically tight. 

These results give a reasonably satisfying characterization of the number of 
membership queries needed to learn a concept class C in terms of a combinatorial 
parameter of the class, the extended teaching dimension, XTD(C). The factor of 
roughly log \C\ difference between the lower bound and the upper bound may be 
thought of as tolerably small, being the number of bits needed to name all the 
concepts in C. Analogous results are achievable for algorithms that use MQ’s 
and EQ’s and for algorithms that use EQ’s alone. 

11 XD and MQ&EQ- Algorithms 

Generalizing Moshkov’s results, Hegediis m bounds the number of MQ’s and 
EQ’s needed to learn a concept class in terms of the exclusion dimension. Inde- 
pendently, Hellerstein et al. m, introduce the idea of polynomial certificates to 
characterize learnability with a polynomial number of MQ’s and EQ’s. 

For any concept class C, 

XD(C) < #MQ&EQ(C) < (XD(C)) • (Llog|C|J). (26) 

An adversary argument establishes the lower bound. Let c' ^ C he any concept 
such that the minimum specifying set for c' has size d = #MQ&EQ(C). An 
adversary can answer any sequence of at most {d— 1) MQ’s and EQ’s as though 
the target concept were c'. (Note that because EQ’s must use concepts in C, 
there cannot be an equivalence query with d itself.) At this point, there must 
be at least two concepts in C consistent with the answers given, so a successful 
learning algorithm must ask at least one more query. 

The upper bound is established by a simulation of the halving algorithm. If 
an XEQ is made with concept c', then if c' G C, it is already an EQ and need 
not be replaced. If d ^ C, then we take a minimum specifying set S for d with 
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respect to C and replace the XEQ by MQ’s about the elements of S, as described 
in Section fTOl 

Using the standard optimal algorithm instead of the halving algorithm gives 
the following. 

#MQ&EQ(C) < (XD(C)) • (#XEQ(C)). (27) 

Again Hegediis improves the upper bound of (I2till by making a more careful 
choice of the ordering of MQ’s, and gives an example of a family of classes for 
which the improved bound is asymptotically tight. 

#MQ&EQ(C) < (2XD(C)/(logXD(C))) • ([log |C|J). (28) 

The key difference in the bounds for MQ-algorithms and MQ&EQ-algorithms 
is that with both MQ’s and EQ’s, we do not need to replace an XEQ with a 
concept c G C, so only the specifying sets for concepts not in C matter, whereas 
with only MQ’s we may need to simulate XEQ’s for concepts in C, so specifying 
sets for all concepts may matter. 

12 A Dimension for EQ- Algorithms? 

Can we expect a similar characterization for learning a class C with proper 
equivalence queries only? The short answer is yes, but the story is a little more 
complicated. 

We’ll need samples as well as concepts. A sample s is a partial function from 
X to {0,1}. A sample may also be thought of as a subset of elements of X 
and their classifications, or a function from X to (0, 1, *}, with * standing for 
“undefined.” If we identify a concept c with its characteristic function, mapping 
X to jo, 1}, then a concept is a special case of a sample. Two samples are 
consistent if they take the same values on the elements common to both of their 
domains. A sample s' extends a sample s if they are consistent and the domain 
of s is a subset of the domain of s' . 

It is interesting to note that the partial equivalence queries of Maass and 
Turan m can be characterized as equivalence queries with samples instead of 
just concepts. 

12.1 The Fingerprint Dimension 

Early work on lower bounds for equivalence queries introduced the property of 
approximate fingerprints [2|, which is sufficient to guarantee that a family of 
classes of concepts cannot be learned with a polynomial number of EQ’s. This 
technique was applied to show that there is no polynomial-time EQ-algorithm 
for finite automata, DNF formulas, and many other classes of concepts. 

Gavalda jS] proved that a suitable modification of the negation of the approx- 
imate fingerprint property is both necessary and sufficient for learnability with 
a polynomial number of proper equivalence queries. Hayashi et al. cn general- 
ized the definitions to cover combinations of various types of queries. Stripped of 
details not relevant to this development, the ideas may be formulated as follows. 
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If C is a concept class, c £ C, and c? is a positive integer, then we define c to 
be 1/d-good for C if for every x & X, & fraction of at least 1/d of the concepts in 
C agree with the classification of x by c. This idea generalizes the majority vote 
concept for a class C, which is 1/2-good for C. If we make an EQ with a concept 
c that is 1/ d-good for C, then any counterexample must eliminate a fraction of 
at least 1/d of the concepts in C. 

Given a concept class C, we say that C' C C is reachable from C if there exists 
a sample s such that C consists of all those concepts in C that are consistent 
with s. Not every subclass of a concept class is necessarily reachable. 

Examples. For C = S^{X), the subclasses {{a;}} are reachable (using the 
sample s = {(x, 1)}), and subclasses consisting of <S'+(y) for E C X are reachable 
(using a sample that maps the elements oi X — Y to 0), but the subclass S'(-^), 
consisting of the singletons, is not reachable. 

Given a concept class C, the fingerprint dimension of C, denoted FD(C'), is 
the least positive integer d such that for every reachable subclass C of C, there 
is a concept c' € C that is 1/d-good for C . 

To see that FD(G) is well-defined, note that for any concept class C and any 
concept c e C, c is at least 1/ICI-good for C, because c at least agrees with 
itself. A concept class C containing only one concept has FD(G) = 1, but any 
concept class C containing at least two concepts has FD(G) > 2. 

We now show that the fingerprint dimension gives bounds on the number 
of EQ’s necessary to learn a class of concepts for any class C of concepts, as 
follows. 

FD(G) - 1 < #EQ(C) < \FY){C) In |C|1 . (29) 

If C has only one concept, then 0 = FD(C) — 1 = #EQ(C'), so both inequal- 
ities hold in this case. Assume C has at least two concepts, and let d = FD(C). 
Glearly d > 2. 

We describe a learning algorithm to achieve the upper bound. At any point, 
there is a class C reachable from C that is consistent with the answers to all 
the queries made so far. If C contains one element, then the algorithm halts. 
Otherwise, by the definition of FD(G) there is a concept c' G C that is 1/d-good 
for C', and the algorithm makes an EQ with this concept d . 

Either the answer is “y^s/’ or ^ counterexample x eliminates a fraction of 
at least 1/d of the concepts in C . This continues until exactly one concept 
c G C is consistent with all the answers to queries. Then i queries are sufficient 
if (1 — l/d)*|C'| < 1. Hence, |"dln|C'|] EQ’s suffice. 

For the lower bound, because d is a minimum, there is a reachable subclass 
C of C that has no l/(d— l)-good concept. For this to be true, \C'\ > d. Thus, 
for each concept c' G C", there exists an element x G X such that the fraction 
of concepts in C that agree with the classification of x by c' is smaller than 
l/(d — 1). (This x could be termed a l/(d — 1 (-approximate fingerprint for c' 
with respect to C .) 

Let s be the sample that witnesses the reachability of C from C. That is, 
C consists of those elements of C that are consistent with s. We describe an 
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adversary to answer EQ’s for C that maintains a fraction of at least {d — i — 
l)/{d — 1) of the concepts in C consistent with the answers to the first i EQ’s. 

This is clearly true when z = 0. For an EQ with c G C, if c ^ C", then c must 
not be consistent with s, and the adversary returns as a counterexample any 
element x such that s and c classify x differently. If c G C', then by our choice 
of C", there is an element x such that the fraction of elements of C that classify 
X the same way as c is smaller than l/(d— 1). The adversary returns any such x 
as a counterexample. Queries of the first type do not eliminate any elements of 
C", and queries of the second type eliminate fewer than 1))|C'| elements 

of C", so after d — 2 EQ’s, there are at least 

|C'|/(d-l)>l 

concepts in C consistent with all the answers the adversary has given. Hence, 
any EQ-algorithm must use at least d — 1 EQ’s, establishing the lower bound. 



12.2 The Sample Exclusion Dimension 

Balcazar et al. introduce the strong consistency dimension [^, which also yields 
bounds on the number of EQ’s to learn a concept class. We give a slight variant 
of that definition, which generalizes the exclusion dimension from concepts to 
samples. 

Let C be a concept class and s a sample. A specifying set for s with respect 
to C is a set S contained in the domain of s such that at most one concept c £ C 
is consistent with the sample s' obtained by restricting s to the elements of S. 
Note that this coincides with our previous definition of a specifying set if s is 
itself a concept. 

Define the sample exclusion dimension of a class C of concepts, denoted 
SXD(C), to be the maximum over all samples s such that s is not consistent 
with any c G C, of the minimum size of any specifying set for s. This generalizes 
the exclusion dimension from concepts not in C to samples not consistent with 
any concept in C. For C = 2^ we stipulate that SXD(C) = 0. 

Because the maximization is over samples and not just concepts, for any class 
of concepts C, 

XD(C) < SXD(C). (30) 

This differs from the strong consistency dimension introduced by Balcazar et 
al. [B] by at most 1, and coincides, in the case of equivalence queries, with the 
abstract identification dimension, also introduced by Balcazar et al. [3]. 

Examples. To get a sense of the difference between the exclusion dimension 
and the sample exclusion dimension, consider the concept class Ci, presented in 
Figure [3 This is a version of addressing, described by Maass and Turan |TSj. 

The empty set is not an element of Ci, but it has a specifying set {xi,X 2 }, 
because only ci also does not include either xi or X 2 . However, the sample 



s = {(yi,0),(?/2,0),(y3,0),(?/4,0)}. 
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Fig. 7. Concept class Ci, a version of addressing 



which is not defined for x\ and a/ 2 , is not consistent with any element of Ci, 
but its smallest specifying sets have 3 elements, for example, {//i, 2/2, t/a}- Gen- 
eralizing this example to 2" concepts with n address bits gives an exponential 
disparity between the exclusion dimension and the sample exclusion dimension. 

The sample exclusion dimension is a lower bound on the number of EQ’s 
needed to learn a concept class C. For any concept class C, 

SXD(C) < #EQ(C). (31) 

If C = 2^, then SXD(C) = 0 and the bound holds, so assume C ^ 2^ . We 
describe an adversary to enforce at least d = SXD(C) EQ’s. Let s be a sample 
that is not consistent with any c G C such that the size of the smallest specifying 
set for s with respect to C has size d. Any EQ with a concept c G C can be 
answered with an element x in the domain of s, because s is not consistent with 
any c G C. Up to {d — 1) EQ’s can be answered thus, and there will still be at 
least two concepts in C consistent with all the answers given, so any successful 
learning algorithm must make at least one more EQ. 

Combining (l29l) and (ED, we have 

SXD(C) < #EQ(C) < [FD(C) In |Cn ■ (32) 

The sample exclusion dimension also gives an upper bound on the fingerprint 
dimension. 



FD(C) < SXD(C) -k 1. (33) 

If C contains only one concept, then FD(C) = 1 and SXD(C) = 0, and the 
bound holds. Assume that C contains at least two concepts, and let d = SXD(C). 
Clearly d > 1. Consider any subclass C" reachable from C, and let s be the 
sample that witnesses the reachability of C". That is, C" is the set of concepts in 
C consistent with s. We show that C contains a concept c! that is l/((i-|- l)-good 
for C . 

Define another sample s' as follows. Let s' {x) = 1 if a fraction of more than 
d/{d + I) concepts in C contain x, and let s'{x) = 0 if a fraction of more than 
d/ {d+ 1) concepts in C do not contain x. Note that s' is not defined for elements 
x for which the majority vote of C does not exceed a fraction d/{d+V) of the 
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total number of elements of C . Note that s' extends s because all of the elements 
of C agree on elements in the domain of s. 

We claim that s' is consistent with some element of C. If not, then by the 
definition of SXD(C), there exists a specifying set S for s' with respect to C 
that contains at most d elements. Consider the set of elements of C that are 
consistent with s' for all the elements of S. Agreement with s' on each element of 
S eliminates a fraction of less than l/(d+ 1) of the elements of C", so agreement 
on all the elements of S eliminates a fraction smaller than d/{d + 1) of the 
elements of C . Thus, at least one element of C is consistent with s' on all the 
elements of S, contradicting the assumption that s' is not consistent with any 
element in C . 

Thus, there is some element c £ C consistent with s' , and since s' extends s, 
c S C . Thus, the concept c is a l/(d+ l)-good element of C . Because C was an 
arbitrary reachable subclass of C, we have that FD(C) < (d + 1), establishing 
the bound. 

As a corollary of ijflfni and the upper bound in (ESJ, we have 

#EQ(C) < [(SXD(C) + l)ln|Cn. (34) 



12.3 Inequivalence of FD(C) and SXD(C) 

Despite their similar properties in bounding ^EQ(C'), the two dimensions FD(C) 
and SXD(C) are different for some concept classes. 

Let X 2 k+i = {xi,X 2 , . . . , X 2 k+i} and let Ck consist of all subsets of X 2 k+i of 
cardinality at most k. Then \Ck\ = 2^^ and In \C\ = 0{k). 

We have SXD(Cfc) = k because the only samples inconsistent with every 
concept in C must take on the value 1 for at least fc + 1 domain elements, and 
a minimum specifying set will contain k domain elements with the value 1. On 
the other hand, FD(Cfe) = 2, because every reachable subclass of Ck contains its 
majority vote concept. Of course, #EQ(Cfc) = A:, by a strategy that begins by 
conjecturing the empty set, and adds positive counterexamples to the conjecture 
until it is answered “yes.” 

Thus, for the family of classes Ck, the sample exclusion dimension gives 
a tight lower bound, k, and a loose upper bound, 0{k'^), while the fingerprint 
dimension gives a loose lower bound, 1, and an asymptotically tight upper bound, 
0(k), on the number of EQ’s required for learning. This is asymptotically as large 
as the discrepancy can be, as witnessed by (E3), which is the combination that 
gives the strongest bounds on #EQ(C) at present. 

13 What about the VC-Dimension? 

Because the Vapnik-Chervonenkis dimension is so useful in PAC learning, it is 
natural to ask what its relationship is to learning with queries. A set S C X 
is shattered by a concept class C if all 2l'®l possible labellings of elements in S 
are achieved by concepts from C. The VC-dimension of a class C of concepts. 
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denoted VCD(C), is the maximum cardinality of any set shattered by C. It is 
clear that for any concept class C, 

VCD((:7) < log|C|. (35) 

This and © imply 

VCD(C) < #MQ(C). (36) 

As Littlestone m observed, an adversary giving counterexamples from a shat- 
tered set can enforce VCD(C) XEQ’s, and therefore 

VCD(C) < #XEQ(C) < #EQ(C). (37) 

Maass and Turan [1^ show that for any concept class C, 

^VCD(C) < #MQ&EQ(C). (38) 

They give an example of a family of concept classes that shows that the constant 
1/7 cannot be improved to be larger than 0.41, and also show that 

ivCD(C7) < #MQ&XEQ(C). (39) 

14 More General Dimensions 

Balcazar et al. present generalizations of the dimensions XTD(C), XD((7) and 
SXD(C) to arbitrary kinds of example-based queries [4], and beyond It is 
outside the scope of this sketch to treat their results fully, but we briefly describe 
the settings. For convenience we identify a concept c with its characteristic func- 
tion, and write c{x) = 1 \i x & c. 

In 1^, for an example-based query with a target concept c, the possible 
replies are identified with samples consistent with c, that is, with subfunctions 
of c. Thus, for a membership query about x, the reply is the singleton sample 
{(a:, c(a;))}. For an equivalence query with the concept c', the possible replies 
are either a counterxample x, which is represented by the sample {(x, c(a;))}, or 
“yes,” which is represented by the sample equal to c, completely specifying it. For 
a subset query with c/ the possible replies are either a counterexample, which 
is a singleton sample {(a;, 0)} such that c'(a:) = 1 and c{x) = 0, or “yes,” which 
is represented by the sample consisting of all pairs (x, 1) such that c'(x) = 1. 

A protocol is a ternary relation on queries, target concepts, and possible an- 
swers. Two conditions are imposed on the relation. One is completeness, which 
requires that every possible query and target concept, there is at least one possi- 
ble answer. The other is fair play, which requires that if an answer a is possible 
for a query q and a target concept c, then for any other target concept c' such 
that the answer a is a subfunction of c', a is a possible answer for q with target 
concept c' . The fair play condition ensures that an answer cannot “rule out” 
a candidate hypothesis unless it is inconsistent with it. For this setting, a very 
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general dimension, the abstract identification dimension^ is defined and shown 
to generalize the extended teaching dimension, the exclusion dimension, and the 
sample exclusion dimension. 

In [5], Balcazar et al. define an even more general setting, covering many 
kinds of non-example-based queries. In this setting, the answer to a query is 
identified with a property that is true of the target concept, or equivalently, a 
subset of concepts that includes the target concept, or a Boolean function on 
all possible concepts that is true for the target concept. For example, if the 
target concept is c, a restricted equivalence query with the concept c' returns 
only the answers “yes” (if c' = c) and “no” (if c' c), with no counterexample. 
The reply “yes” can be formalized as the singleton {c}, specifying c completely, 
while the reply “no” can be formalized as the set 2^ — {c'}, which gives only the 
information that c c^ In this setting, the authors define the general dimension 
for a target class and learning protocol and prove that the optimal number of 
queries for the class and the protocol is bounded between this dimension and 
this dimension times [In |C|]. 



15 Remarks 

The approach of bounding the number of queries required to learn concepts from 
a class C using combinatorial properties of C has made great progress. This 
sketch has omitted very many things, including the fascinating applications of 
these results to specific concept classes. One major open problem is whether 
DNF formulas can be learned using a polynomial number of MQ’s and EQ’s. 
The reader is strongly encouraged to consult the original works. 
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Abstract. In this paper we claim that meaningful representations can 
be learned by programs, although today they are almost always designed 
by skilled engineers. We discuss several kinds of meaning that repre- 
sentations might have, and focus on a functional notion of meaning as 
appropriate for programs to learn. Specifically, a representation is mean- 
ingful if it incorporates an indicator of external conditions and if the 
indicator relation informs action. We survey methods for inducing kinds 
of representations we call structural abstractions. Prototypes of sensory 
time series are one kind of structural abstraction, and though they are 
not denoting or compositional, they do support planning. Deictic rep- 
resentations of objects and prototype representations of words enable a 
program to learn the denotational meanings of words. Finally, we discuss 
two algorithms designed to find the macroscopic structure of episodes in 
a domain-independent way. 



1 Introduction 

In artificial intelligence and other cognitive sciences it is taken for granted that 
mental states are representational. Researchers differ on whether representations 
must be symbolic, but most agree that mental states have content — they are 
about something and they mean something — irrespective of their form. Re- 
searchers differ too on whether the meanings of mental states have any causal 
relationship to how and what we think, but most agree that these meanings 
are (mostly) known to us as we think. Of formal representations in computer 
programs, however, we would say something different: Generally, the meanings 
of representations have no influence on the operations performed on them (e.g., 
a program concludes q because it knows p ^ q and p, irrespective of what p 
and q are about); yet the representations have meanings, known to us, the de- 
signers and end-users of the programs, and the representations are provided to 
the programs because of what they mean (e.g., if it was not relevant that the 
patient has a fever, then the proposition febrile (patient) would not be pro- 
vided to the program — programs are designed to operate in domains where 

N. Abe, R. Khardon, and T. Zeugmann (Eds.): ALT 2001, LNAI 2225, pp. 32-|5^ 2001. 
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meaning matters.). Thus, irrespective of whether the contents of mental states 
have any causal influence on what and how we think, these contents clearly are 
intended (by us) to influence what and how our programs think. The meanings 
of representations are not irrelevant but we have to provide them. 

If programs could learn the meanings of representations it would save us a 
great deal of effort. Most of the intellectual work in AI is done not by programs 
but by their creators, and virtually all the work involved in specifying the mean- 
ings of representations is done by people, not programs (but see, e.g., |27I221 
m)- This paper discusses kinds of meaning that programs might learn and gives 
examples of such programs. 

How do people and computers come to have contentful, i.e., meaningful, men- 
tal states? As Dennett [10] points out, there are only three serious answers to the 
question: Contents are learned, told, or innate. Lines cannot be drawn sharply 
between these, in either human or artificial intelligence. Culture, including our 
educational systems, blurs the distinction between learning and being told; and 
it is impossible methodologically to be sure that the meanings of mental states 
are innate, especially as some learning occurs in utero [S] and many studies of 
infant knowledge happen weeks or months after birth. 

One might think the distinctions between learning, being told, and innate 
knowledge are clearer in artificial systems, but the role of engineers is rarely 
acknowledged 18IMT^ . Most AI systems manipulate representations that mean 
what engineers intend them to mean; the meanings of representations are exoge- 
nous to the systems. It is less clear where the meanings of learned representations 
reside, in the minds of engineers or the “minds of the machines” that run the 
learning algorithms. We would not say that a linear regression algorithm knows 
the meanings of data or of induced regression lines. Meanings are assigned by 
data analysts or their client domain experts. Moreover, these people select data 
for the algorithms with some prior idea of what they mean. Most work in ma- 
chine learning, KDD, and AI and statistics are essentially data analysis, with 
humans, not machines, assigning meanings to regularities found in the data. 

We have nothing against data analysis, indeed we think that learning the 
meanings of representations is data analysis, in particular, analysis of sensory 
and perceptual time series. Our goal, though, is to have the machine do all of it: 
select data, process it, and interpret the results; then iterate to resolve ambigui- 
ties, test new hypotheses, refine estimates, and so on0 The relationship between 
domain experts, statistical consultants, and statistical algorithms is essentially 
identical to the relationship between domain experts, AI researchers, and their 
programs: In both cases the intermediary translates meaningful domain concepts 
into representations that programs manipulate, and translates the results back 
to the domain experts. We want to do away with the domain expert and the en- 



^ An early effort in our laboratory to automate applied statistics was Rob St. Amant’s 
dissertation |26I25J : while it succeeded in many respects, it had only weak notions 
of the meaning of representations, so its automated data analysis had a formal, 
syntactic feel (e.g., exploring high leverage points in regression analysis). 
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gineers/statistical consultants, and have programs learn representions and their 
meanings, autonomously. 

One impediment to learning the meanings of representations is the fuzziness 
of commonsense notions of meaning. Suppose a regression algorithm induces a 
strong relationship between two random variables x and y and represents it in 
the conventional way: y = 1.31a: — .03, = .86, F = 108.3, p < .0001. One 

meaning of this representation is provided by classical inferential statistics: x and 
y appear linearly related and the relationship between these random variables 
is very unlikely to be accidental. Note that this meaning is not accessible in any 
sense to the regression algorithm, though it could be known, as it is conventional 
and unambiguous. 0 Now, the statistician might know that x is daily temperature 
and y is ice-cream sales, and so he or his client domain expert might assign 
additional meaning to the representation, above. For instance, the statistician 
might warn the domain expert that the assumptions of linear regression are not 
well-satisfied by the data. Ignoring these and other cautions, the domain expert 
might even interpret the representation in causal terms (i.e., hot weather causes 
people to buy ice-cream). Should he submit the result to an academic journal, 
the reviews would probably criticize this semantic liberty and would in any case 
declare the result as meaningless in the sense of being utterly unimportant and 
unsurprising. 

This little example illustrates at least five kinds of meaning for the represen- 
tation y = 1.31a: — .03, R? = .86, F = 108.3, p < .0001. There is the formal 
meaning, including the mathematical fact that 86 % of the variance in the ran- 
dom variable y is explained by x. Note that this meaning has nothing to do with 
the denotations of y and x, and it might be utter nonsense in the domain of 
weather and ice-cream, but, of course, the formal meaning of the representation 
is not about weather and ice cream, it is about random variables. Another kind 
of meaning has to do with the model that makes y and x denote ice cream and 
weather. When the statistician warns that the residuals of the regression have 
structure, he is saying that a linear model might not summarize the relationship 
between x and y as well as another kind of model. He makes no reference to the 
denotations of x and j/, but he might, as when he warns that ice cream sales are 
not normally distributed. In both cases, the statistician is questioning whether 
the domain (ice cream sales and weather) is faithfully represented by the model 
(a linear function between random variables). He is questioning whether the “es- 
sential features” of the domain are represented, and whether they are somehow 
distorted by the regression procedure. 

The domain expert will introduce a third kind of meaning: he will interpret 
y = 1.31a: — .03, R^ = .86, F = 108.3, p < .0001 as a statement about ice cream 
sales. This is not to say that every aspect of the representation has an interpre- 
tation in the domain — the expert might not assign a meaning to the coefficient 
— .03 — only that, to the expert, the representation is not a formal object but 

^ Indeed, there is a sense in which stepwise regression algorithms know what F statis- 
tics mean, as the values of these statistics directly affect the behavior of the algo- 
rithms. 
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a statement about his domain. We could call this kind of meaning the domain 
semantics, or the functional semantics, to emphasize that the interpretation of a 
representation has some effect on what the domain expert does or thinks about. 

Having found a relationship between ice cream sales and the weather, the 
expert will feel elated, ambitious, or greedy, and this is a fourth, affective kind of 
meaning. Let us suppose, however, that the relationship is not real, it is entirely 
spurious (an artifact of a poor sampling procedure, say) and is contradicted by 
solid results in the literature. In this case the representation is meaningless in 
the sense that it does not inform anyone about how the world really works. 

To which of these notions of meaning should a program that learns meanings 
be held responsible? The semantics of classical statistics and regression analysis 
in particular are sophisticated, and many humans perform adequate analyses 
without really understanding either. More to the point, what good is an agent 
that learns formal semantics in lieu of domain or functional semantics? The rela- 
tionship between x and y can be learned (even without a statistician specifying 
the form of the relationship), but so long as it is a formal relationship between 
random variables, and the denotations of x and y are unknown to the learner, a 
more knowledgeable agent will be required to translate the formal relationship 
into a domain or functional one. The denotations of x and y might be learned, 
though generally one needs some knowledge to bootstrap the process; for ex- 
ample, when we say, “x denotes daily temperature,” we call up considerable 
amounts of common-sense knowledge to assign this statement meaning. As to 
affective meanings, we believe artificial agents will benefit from them, but we do 
not know how to provide them. 

This leaves two notions of meaning, one based in the functional roles of rep- 
resentations, the other related to the informativeness of representations. The 
philosopher Fred Dretske wrestled these notions of meaning into a theory of 
how meaning can have causal effects on behavior PE]. Dretske’s criteria for 
a state being a meaningful representational state are: the state must indicate 
some condition, have the function of indicating that condition, and have this 
function assigned as the result of a learning process. The latter condition is con- 
tentious but it will not concern us here as this paper is about learning 

meaningful representations. The other conditions say that a reliable indicator re- 
lationship must exist and be exploited by an agent for some purpose. Thus, the 
relationship between mean daily temperature (the indicator) and ice-cream sales 
(the indicated) is apt to be meaningful to ice-cream companies, just as the rela- 
tionship between sonar readings and imminent collisions is meaningful to mobile 
robots, because in each case an agent can do something with the relationship. 
Learning meaningful representations, then, is tantamount to learning reliable re- 
lationships between denoting tokens (e.g., random variables) and learning what 
to do when the tokens take on particular values. 

^ Projects such as Cyc emphasize the denotational meanings of representations [171 
116] . Terms in Cyc are associated with axioms that say what the terms mean. It 
took a collosal effort to get enough terms and axioms into Cyc to support the easy 
acquisition of new terms and axioms. 
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The minimum required of a representation by Dretske’s theory is an indicator 
relationship s <— I{S) between the external world state S and an internal state 
s, and a function that exploits the indicator relationship through some kind of 
action a, presumably changing the world state: /(s, a) — > S. The problems are to 
learn representations s ^ S and the functions / (the relationship ^ is discussed 
below, but here means “abstraction”). 

These are familiar problems to researchers in the reinforcement learning com- 
munity, and we think reinforcement learning is a way to learn meaningful rep- 
resentations (with the reservations we discuss in fSO]). We want to up the ante, 
however, in two ways. First, the world is a dynamic place and we think it is 
necessary and advantageous for s to represent how the world changes. Indeed, 
most of our work is concerned with learning representations of dynamics. 

Second, a policy of the form /(s, a) s manifests an intimate relationship 
between representations s and the actions a conditioned on them: s contains 
the “right” information to condition a. The right information is almost always 
an abstraction of raw state information; indeed, two kinds of abstraction are 
immediately apparent. Not all state information is causally relevant to action, 
so one kind of abstraction involves selecting information in S to include in s (e.g., 
subsets or weighted combinations or projections of the information in S). The 
other kind of abstraction involves the structure of states. Consider the sequence 
AABACAABACAABACAABADAABAC. Its Structure Can be described many ways, 
perhaps most simply by saying, “the sequence AABAa; repeats five times, and 
X =C in all but the fourth replication, when x =D.” This might be the abstraction 
an agent needs to act; for example, it might condition action on the distinction 
between AABAC and AABAD, in which case the “right” representation of the 
sequence above is something like this piSipiSipiSipiS 2 PiSi, where p and s denote 
structural features of the original sequence, such as “prefix” and “suffix’. We call 
representations that include such structural features structural abstractions. 

To recap, representations s are meaningful if they are related to action by 
a function /(s, a) S, but / can be stated more or less simply depending on 
the abstraction s S. One kind of abstraction involves selecting from the infor- 
mation in S, the other is structural abstraction. The remainder of this paper is 
concerned with learning structural abstractions 0 Note that, in Dretske’s terms, 
structural abstractions can be indicator functions but not all indicator functions 
are structural abstractions. Because the world is dynamic, we are particularly 
concerned with learning structural abstractions of time series. 

2 Approach 

In the spirit of dividing problems to conquer them we can view the problem of 
learning meaningful representations as having two parts: 

^ Readers familiar with Artificial Intelligence will recognize the problem of learning 
structural abstractions as what we call “getting the representation right,” a creative 
process that we reserve unto ourselves and to which, if we are honest, we must 
attribute most of the performance of our programs. 
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1. Learn representations s ~ S' 

2. Learn functions /(s,a) — > S 

This strategy is almost certainly wrong because, as we said, s is supposed to 
inform actions a, so should be learned while trying to act. Researchers do not 
always follow the best strategy (we don’t, anyway) and the divide and conquer 
strategy we did follow has produced an unexpected dividend: We now think it 
is possible that some kinds of structural abstraction are generally useful, that is, 
they inform large classes of actions. (Had we developed algorithms to learn s to 
inform particular actions a we might have missed this possibility.) Also, some 
kinds of actions, particularly those involving deictic acts like pointing or saying 
the name of a referent, provide few constraints on representations. 

Our approach, then, is to learn structural abstractions first and functions 
relating representations and actions to future representations, second. 



3 Structural Abstractions of Time Series and Sequences 

As a robot wanders around its environment, it generates a sequence of values of 
state variables. At each instant t we get a vector of values Xt (our robot samples 
its sensors at lOHz, so we get ten such vectors each second). Suppose we have 
a long sequence of such vectors X = Xq,Xi, . . .. Within X are subsequences Xij 
that, when subjected to processes of structural abstraction, give rise to episode 
structures that are meaningful in the sense of informing action 0. The trick is 
to find the subsequences Xij and design the abstraction processes that produce 
episode structures. We have developed numerous methods of this sort and survey 
them briefly, here. 

3.1 Structural Abstraction for Continuous Multivariate Series 

State variables such as translational velocity and sonar readings take continuous 
values, in contrast to categorical variables such as “object present in the visual 
held.” Figure [T] shows four seconds of data from a Pioneer 1 robot as it moves 
past an object. Prior to moving, the robot establishes a coordinate frame with 
an X axis perpendicular to its heading and a y axis parallel to its heading. As 
it begins to move, the robot measures its location in this coordinate frame. 
Note that the robot-X line is almost constant. This means that the robot did 
not change its heading as it moved. In contrast, the robot-y line increases, 
indicating that the robot does increase its distance along a line parallel to its 
original heading. Note especially the VIS-A-X and VIS-A-Y lines, which represent 
the horizontal and vertical locations, respectively, of the centroid of a patch of 
light on the robotUs “retina,” a CCD camera. ViS-A-x decreases, meaning that 
the object drifts to the left on the retina, while ViS-A-Y increases, meaning the 

® Ideally, structural abstraction should be an on-line process that influences action 
continuously. In practice, most of onr algorithms gather data in batches, form ab- 
stractions, and use these to inform action in later episodes. 
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object moves toward the top of the retina. Simultaneously, both series jump to 
constant values. These values are returned by the vision system when nothing is 
in the field of view. 




Fig. 1. As the robot moves, an object approaches the periphery of its field of view then 
passes ont of sight. 



Every time series that corresponds to moving past an object has qualitatively 
the same structure as the one in Figure [T] It follows that if we had a statistical 
technique to group the robotUs experiences by the characteristic patterns in 
multivariate time series (where the variables represent sensor readings), then 
this technique would in effect learn a taxonomy of the robotUs experiences. 
Clustering by dynamics is such a technique: 

1. A long multivariate time series is divided into segments, each of which rep- 
resents an episode such as moving toward an object, avoiding an object, 
crashing into an object, and so on. (Humans divide the series into episodes 
by hand; more on this in section |^) The episodes are not labeled in any way. 

2. A dynamic time warping algorithm compares every pair of episodes and re- 
turns a number that represents the degree of similarity between the time 
series in the pair. Dynamic time warping is a technique for “morphing” one 
multivariate time series into another by stretching and compressing the hor- 
izontal (temporal) axis of one series relative to the other [13] . The algorithm 
returns a degree of mismatch (conversely, similarity) between the series after 
the best fit between them has been found. 

3. Given similarity numbers for every pair of episodes, it is straightforward to 
cluster episodes by their similarity. 

4. Another algorithm finds the “central member” of each cluster, which we call 
the cluster prototype following Rosch |24| . 

Clustering by dynamics produces structural abstractions (prototypes) of time 
series, the question is whether these abstractions can be meaningful in the sense 
of informing action. In his PhD dissertation. Matt Schmill shows how to use 
prototypes as planning operators. The first step is to learn rules of the form, “in 
state i, action a leads to state j with probability p.” These rules are learned by a 
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classical decision-tree induction algorithm, where features of states are decision 
variables. Given such rules, the robot can plan by means-ends analysis. It plans 
not to achieve world states specified by exogenous engineers, as in conventional 
generative planning, but to achieve world states which are preconditions for its 
actions. Schmill calls this “planning to act,” and it has the effect of gradually 
increasing the size of the corpus of prototypes and things the robot can do. 
The neat thing about this approach is that every action produces new data, 
i.e., revises the set of prototypes. It follows that the robot should plan more 
frequently and more reliably as it gains more experiences, and in recent exper- 
iments, Schmill demonstrates this. Thus, Schmill’s work shows that clustering 
by dynamics yields structural abstractions of time series that are meaningful in 
the sense of informing action. 

There is also a strong possibility that prototypes of this kind are meaningful 
in the sense of informing communicative actions. Oates, Schmill and Cohen [29] 
report a very high degree of concordance between the clusters of episodes gener- 
ated by the dynamic time warping method, above, and clusters generated by a 
human judge. The prototypes produced by dynamic time warping are not weird 
and unfamiliar to people, but seem to correspond to how humans themselves 
categorize episodes. Were this not the case, communication would be hard, be- 
cause the robot would have an ontology of episodes unfamiliar to people. Oates, 
in particular, has been concerned with communication and language, and has 
developed several methods for learning structural abstractions of time series that 
correspond to words in speech and denotations of words in time series of other 
sensors, as we shall see, shortly. 



3.2 Structural Abstraction for Categorical Sequences 

Ramoni, Sebastian! and we have developed Bayesian algorithms for clustering 
activities by their dynamics | 18| . In this work, dynamics are captured in first- 
order markov chains, and so the method is best-suited to clustering sequences 
of discrete symbols. The Bayesian Clustering by Dynamics (bcd) algorithm is 
easily sketched: Given time series of tokens that represent states, construct a 
transition probability table (i.e., a markov chain model) for each series, then 
measure the similarity between each pair of tables using the Kullback-Liebler 
(KL) distance, and finally group similar tables into clusters. The BCD algorithm 
is agglomerative, which means that initially, there is one cluster for each markov 
chain, then markov chains are merged, iteratively, until a stopping criterion is 
met. Merging two markov chains yields another markov chain. The stopping 
criterion in BCD is that the posterior probability of the clustering is maximum. 
Said in another way, BCD solves a Bayesian model selection problem where the 
model it seeks is the most probable partition of the original markov chains given 
the data and the priors (a partition is a division of a set into mutually exclusive 
and exhaustive subsets). As the space of partitions is exponential, BCD uses 
the KL distance as a heuristic to propose which markov chains to merge, but 
only merges them if doing so improves the marginal likelihood of the resulting 
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partition. We have developed versions of BCD for series of a single state variable 
and for series of vectors of state variables [mi]. 

The clusters found by BCD have never been used by the robot to inform 
its actions, as in Schmill’s experiments, so we cannot say they are meaningful 
to the robot. It is worth mentioning that they have high concordance with the 
clustering produced by dynamic time warping and with human clustering, when 
applied to the series in the Oates et al. experiment, cited above. 



3.3 A Critique of Sensory Prototypes 

Clustering by dynamics, whether by dynamic time warping or Bayesian model 
selection, takes a set of time series or sequences and returns a partition of the 
set. Prototypes, or “average members,” may then be extracted from the resulting 
subsets. While the general idea of clustering by dynamics is attractive (it does 
produce meaningful structural abstractions), the methods described above have 
two limitations. First, they require someone (or some algorithm) to divide a 
time series into shorter series that contain instances of the structures we want 
to find. For example, if we want to find classes of interactions with objects, we 
must provide a set of series each of which contains one interaction with objects. 
Neither technique can accept time series of numerous undifferentiated activities 
(e.g., produced by a robot roaming the lab for an hour). 

A more serious problem concerns the kinds of structural abstraction pro- 
duced by the methods. The dynamic time warping method produces “average 
episodes,” of which Figure [His an example, and BCD produces “average markov 
chains,” which are just probability transition matrices. Suppose we examine an 
instance of each representation that corresponds to the robot rolling past a cup. 
Can we find anything in either representation that denotes the cup? We can- 
not. Consequently, these representations cannot inform actions that depend on 
individuating the cup; for example, the robot could not respond correctly to the 
directive, “Turn toward the cup.” The abstractions produced by the algorithms 
contain sufficient structure to cluster the episodes, but still lack much of the 
structure of the episodes. This is particularly true of the markov chain models, 
in which the series is chopped up into one-step transitions and all global infor- 
mation about the “shape” of the series is lost, but even the representation in 
Fig. [T] does not individuate the objects in an episode. 

If one is comfortable with a crude distinction between sensations and con- 
cepts, then the structural abstractions produced by the methods described above 
are entirely sensory m- They are abstractions of the dynamics of sensor values — 
of how an episode “feels” — they do not represent concepts such as objects, ac- 
tors, actions, spatial relationships, and the like. Fig. [T] represents the sensations 
of moving past an object, so it is meaningful if the robot conditions actions on its 
sensations (as it does in Matt Schmill’s work) but it is not a representation of an 
object, the act of moving, the distance to the object, or any other individuated 
entity. 
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Nor for that matter do these abstractions make explicit other structural fea- 
tures of episodes, such as the boundaries between sub-episodes or cycles among 
states. 

Oates has developed methods for learning structural abstractions of time 
series that individuate words in speech and objects in a scene. Oates’ methods 
are described in the following section. We also have implemented algorithms for 
finding the boundaries in episodes and the hierarchical structure of episodes; 
these are described in section |5] 

4 Learning Word Meanings 

Learning the meanings of words in speech clearly requires individuation of el- 
ements in episodes. Suppose we wanted to talk to the robot about cups: We 
would say, “there’s a cup” when we see it looking at a cup; or, “a cup is on your 
left,” when a cup is outside its field of view; or, “point to the cup,” when there 
are several objects in view, and so on. To learn the meaning of the word “cup” 
the robot must first individuate the word in the speech signal, then individuate 
the object “cup” in other sensory series, associate the representations; and per- 
haps estimate some properties of the object corresponding to the cup, such as 
its color, or the fact that it participates as a target in a “turn toward” activity. 
In his PhD dissertation, Oates discusses an algorithm called peruse that does 
all these things m- 

To individuate objects in time series Oates relies on deictic markers — func- 
tions that map from raw sensory data to representations of objects SEm. A 
simple deictic marker might construct a representation whenever the area of 
colored region of the visual field exceeds a threshold. The representation might 
include attributes such as the color, shape, and area, of the object, as well as 
the intervals during which it is in view, and so on. 

To individuate words in speech, Oates requires a corpus of speech in which 
words occur multiple times (e.g., multiple sentences contain the word “cup”). 
Spoken words produce similar (but certainly not identical) patterns in the speech 
signal, as one can see in Figure |2] (In fact, Oates’ representation of the speech 
signal is multivariate but the univariate series in Fig. E] will serve to describe 
his approach.) If one knew that a segment in Figure El corresponded to a word, 
then one could find other segments like it, and construct a prototype or average 
representation of these. For instance, if one knew that the segment labeled A 
in Figure E] corresponds to a word, then one could search for similar segments 
in the other sentences, find A’, and construct a prototype from them. These 
problems are by no means trivial, as the boundaries of words are not helpfully 
marked in speech. Oates treats the boundaries as hidden variables and invokes 
the Expectation Maximization algorithm to learn a model of each word that 
optimizes the placement of the boundaries. However, it is still necessary to be- 
gin with a segment that probably corresponds to a word. To solve this problem, 
Oates relies on versions of the boundary entropy heuristic and frequency heuris- 
tics, discussed below. In brief, the entropy of the distribution of the “next tick” 
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spikes at episode (e.g., word) boundaries; and the patterns in windows that con- 
tain boundaries tend to be less frequent than patterns in windows that do not. 
These heuristics, combined with some methods for growing hypothesized word 
segments, suffice to bootstrap the process of individuating words in speech. 




Fig. 2. Corresponding words in four sentences. Word boundaries are shown as boxes 
around segments of the speech signal. Segments that correspond to the same word are 
linked by connecting lines. 



Given prototypical representations of words in speech, and representations 
of objects and relations, Oates’ algorithm learns associatively the denotations 
of the words. Denotation is a common notion of meaning: The meaning of a 
symbol is what it points to, refers to, selects from a set, etc. However, naive 
implementations of denotation run into numerous difficulties, especially when 
one tries to learn denotations. One difficulty is that the denotations of many 
(perhaps most) words cannot be specified as boolean combinations of properties 
(this is sometimes called the problem of necessary and sufficient conditions). 
Consider the word “cup”. With repeated exposure, one might learn that the 
word denotes prismatic objects less than five inches tall. This is wrong because 
it is a bad description of cups, and it is more seriously wrong because no such 
description of cups can reliably divide the world into cups and non-cups (see, 
e.g., [15|^). 

Another difficulty with naive notions of denotation is referential ambiguity. 
Does the word “cup” refer to an object, the shape of the object, its color, the 
actions one performs on it, the spatial relationship between it and another object, 
or some other feature of the episode in which the word is uttered? How can an 
algorithm learn the denotation of a word when so many denotations are logically 
possible? 

Let us illustrate Oates’ approach to these problems with the word “square,” 
which has a relatively easy denotation. Suppose one’s representation of an object 
includes its apparent height and width, and the ratio of these. An object will 
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appear square if the ratio is near 1.0. Said differently, the word “square” is more 
likely to be uttered when the ratio is around 1.0 than otherwise. Let (p be the 
group of sensors that measures height, width and their ratio, and let x be the 
value of the ratio. Let U be an utterance and W he a, word in the utterance. 
Oates defines the denotation of W as follows: 

denote{W,(j),x) = Pr{contains{U,W)\about{U,(j)),x) (1) 

The denotation of the word “square” is the probability that it is uttered given 
that the utterance is about the ratio of height to width and the value of the ratio. 
More plainly, when we say “square” we are talking about the ratio of height to 
width and we are more likely to use the word when the value of the ratio is 
close to 1.0. This formulation of denotation effectively dismisses the problem 
of necessary and sufficient conditions, and it brings the problem of referential 
ambiguity into sharp focus, for when an algorithm tries to learn denotations it 
does not have access to the quantities on the right hand side of Eq. [Jl it has 
access only to the words it hears: 

hear{W,4>,x) = Pr{contains{U,W)\x) (2) 

The problem (for which Oates provides an algorithm) is to get denote{W, 4>, x) 
from hear{W, (j), x). 

At this juncture, however, we have said enough to make the case that word 
meanings can be learned from time series of sensor and speech data. We claim 
that Oates’ peruse algorithm constructs representations and learns their mean- 
ings by itself. Although the deictic representations of objects are not learned, 
the representations of words are learned and so are the associations between 
features of the deictic representations and words, peruse learns “above” and 
learns to associate the word with a spatial relationship. At no point does an 
engineer implant representations in the system and provide them with interpre- 
tations. Although PERUSE is a suite of statistical methods, it is about as far 
from the data analysis paradigm with which we began this paper as one can 
imagine. In that example, an analyst and his client domain expert select and 
provide data to a linear regression algorithm because it means something to 
them, and the algorithm computes a regression model that (presumably) means 
something to them. Neither data nor model mean anything to the algorithm. 
In contrast, peruse selects and processes speech data in such a way that the 
resulting prototypes are likely to be individuated entities (more on this, below), 
and it assigns meaning to these entities by finding their denotations as described 
earlier. Structural abstraction of representations and assignment of meaning are 
all done by peruse. 

The algorithm clearly learns meaningful representations, but are they mean- 
ingful in the sense of informing action? As it happens, peruse builds word 
representations sufficient for a robot to respond to spoken commands and to 
translate words between English, German and Mandarin Chinese. The denota- 
tional meanings of the representations are therefore sufficient to inform some 
communicative acts. 
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5 Elucidating Episode Structure 

Earlier we described our problem as learning structural abstractions of state 
information, particularly abstractions of the dynamics of state variables, which 
inform action. Section Elintroduced clustering by dynamics and sensory abstrac- 
tions which did not individuate objects or the structure of a robot’s activities. 
The previous section showed how to individuate words and objects and learn 
denotations. This section is concerned with the structure of activities. 

Time series data are sampled at some frequency (10 Hz for our robots) but 
the robot changes what it is doing more slowly. Of course, one can describe what 
a robot is doing at any time scale (up to 10 Hz), but some descriptions are better 
than others in ways we will describe shortly. First, an example. Figure [3] shows 
three variables from a multivariate time series, each running for 1500 ticks, or 
150 seconds. These series together tell a story: the robot approaches an object 
and starts to orbit it; at two points (around tick 500 and again around tick 800) 
the object disappears from view, and the robot’s movement pattern changes in 
response. At this relatively macroscopic scale, then, the robot changes its activity 
half a dozen times, although, as one can see, its sensor values change with much 
higher frequency. 



Fig. 3. Time series of translational velocity, rotational velocity, and area of a region in 
the visual field. 

By episodic structure we mean relatively macroscopic patterns in time series 
that are meaningful in the sense of informing action. If Oates showed how to 
individuate things we might denote with nouns, prepositions, and adjectives, 
episodic structure individuates things we might denote with verbs. 

As noted earlier, the right way to proceed might be to couple the problem of 
learning episodic structure with the problem of learning how to act, but we have 
approached the problems as if they were independent. This means we need a way 
to assess whether a segment of a time series is apt to be meaningful — capable of 
informing action — which is independent of the actions an agent might perform. 




500 1000 1500 



500 1000 1500 



500 1000 1500 
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Said differently, we are looking for a particular kind of marker in time series, 
one that says, “If you divide the series at these markers, then there is a good 
chance that the segments between the markers will be meaningful in the sense 
of informing action.” 

We have identified four kinds of markers of episode structures. 
Coincidences. Random coincidences are rare, so coincidences often mark the 
boundaries of episode structures. Figure |4] shows the same time series as in 
Figure E] though the series have been smoothed and shifted vertically away from 
each other on the vertical axis. Vertical lines have been added at some of the 
points where two or more of the series change slope sharply. Now, if the series 
were unrelated, then such an inflection in one would be very unlikely to coincide 
with an inflection in another, for inflections are rare. If rare events in two or 
more series coincide, then the series are probably not unrelated. Moreover, the 
points of coincidence are good markers of episode structures, as they are points 
at which something causes changes in the series to coincide. 




Fig. 4. The series from Figure O smoothed and shifted apart on the vertical axis. 
Vertical lines show points at which two or more of the series experience a signihcant 
change in slope. 



Boundary entropy. Every unique subsequence in a series is characterized by 
the distribution of subsequences that follow it; for example, the subsequence 
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“en” in this sentence repeats five times and is followed by tokens c, ”, t and s. 
This distribution has an entropy value. In general, every subsequence of length 
has a boundary entropy, which is the entropy of the distribution of subsequences 
that follow it. If a subsequence S is an episode, then the boundary entropies 
of subsequences of S will have an interesting profile: They will start relatively 
high, then sometimes drop, then peak at the last element of S. The reasons for 
this are, first, that the predictability of elements within an episode increases as 
the episode extends over time; and, second, that the element that immediately 
follows an episode is relatively uncertain. Said differently, within episodes, we 
know roughly what will happen, but at episode boundaries we become uncertain. 
Frequency. Episode structures are meaningful if they inform action. Rare struc- 
tures might be informative in the information theoretic sense, have few opportu- 
nities to inform action because they arise infrequently. Consequently, all human 
and animal learning places a premium on frequency (and, by the way, learn- 
ing curves have their characteristic shape). In general, episode structures are 
common structures. However, not all common structures are episode structures. 
Very often, the most frequent structures in a domain are the smallest or short- 
est, while the meaningful structures — those that inform action — are longer. A 
useful example of this phenomenon comes from word morphology. The following 
subsequences are the 100 most frequently-occurring patterns in the first 10,000 
characters of Orwell’s book 1984, but many are not morphemes, that is, mean- 
ingful units: 

th in the re an en as ed to on it er of at ing was or st on ar and es ic el al om 
ad ac is wh le ow Id ly ere he wi ab im ver be for had ent itwas with ir win gh 
po se id ch ot ton ap str his ro li all et fr andthe ould min il ay un ut nr ve 
whic dow which si pi am ul res that were ethe wins not winston sh oo up ack 
ter ough from ce ag pos bl by tel ain 

Even so, frequency provides a good marker for the boundaries of episode 
structures. Suppose that the subsequences wx and yz are both both very common 
and subsequence xy is rare; where would you place a boundary in the sequence 
wxyzl 

Changes in Probability Distributions. Sequences can be viewed as the out- 
puts of finite state machines, and many researchers are interested in inducing 
the machines that generate sequences. Our interest is slightly different; we want 
to know the boundaries of sequences. Another way to say this is we want to 
place boundaries in such a way that the probability distributions to the left and 
right of the boundaries are different. 

5.1 Algorithms 

This section presents two algorithms for learning episode structures. The first is 
based on the coincidence heuristic, above; the second relies on boundary entropy 
and frequency. These algorithms are described in detail in m, respectively, and 
some of the material in the following sections is excerpted from these papers. 
We are working on an online version of BCD that implements the “change in 
probability distribution” heuristic but the work is preliminary. 
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Fluents and Temporal Relationships. The fluent learning algorithm in- 
duces episode structures from time series of binary vectors. A binary vector bt 
is a simple representation of a set of logical propositions at time t: b[i]t — 1 
means proposition pi is true. If a proposition is true for all the discrete times 
in the range m,n (i.e., b[i]m,n = 1) then the proposition is called a base fluent. 
(States with persistence are called fluents by McCarthy |T^.) In an experiment 
with a Pioneer I mobile robot, we collected a dataset of 22535 binary vectors of 
length 9. Sensor readings such as translational and rotational velocity, the out- 
put of a “blob vision” system, sonar values, and the states of gripper and bump 
sensors, were inputs to a simple perceptual system that produced the follow- 
ing nine propositions: STOP, rotate-right, rotate-left, move-forward, 
NEAR-OBJECT, PUSH, TOUCH, MOVE-BACKWARD, STALL. 

Allen [2j gave a logic for relationships between the beginnings and ends of 
fluents. We use a nearly identical set of relationships: 

SBEB X starts before Y, ends before Y; Allen’s “overlap” 

SWEB Y starts with X, ends before X; Allen’s “starts” 

SAEW Y starts after X, ends with X; Allen’s “finishes” 

SAEB Y starts after X, ends before X; Allen’s “during” 

SWEW Y starts with X, ends with X; Allen’s “equal” 

SE Y starts after X ends; amalgamating Allen’s “meets” and “before” 

In Allen’s calculus, “meets” means the end of X coincides exactly with the be- 
ginning of Y, while “before” means the former event precedes the latter by some 
interval. In our work, the truth of a predicate such as SE or SBEB depends on 
whether start and end events happen within a window of brief duration. Said dif- 
ferently, “starts with” means “starts within a few ticks ofl” and “starts before” 
means “starts more than a few ticks before.” The reason for this window is that 
on a robot, it takes time for events to show up in sensor data and be processed 
perceptually into propositions, so coinciding events will not necessarily produce 
propositional representations at exactly the same time. 

Let p e[SBEB, SWEB, SAEW, SAEB, SWEW, S e], and let f be a proposition (e.g., 
moving-forward). Composite fluents have the form: 



F^f \ p{f,f) 

CF^p{F,F) 

That is, a fluent F may be a proposition or a temporal relationship between 
propositions, and a composite fluent is a temporal relationship between fluents. 
A situation has many alternative fluent representations, we want a method for 
choosing some over others. The method will be statistical: We will only accept 
p{F, F) as a, representation if the constituent fluents are statistically associated, 
if they “go together.” 

Consider a composite fluent like SBEB(brake, clutch): When I approach a stop 
light in my standard transmission car, I start to brake, then depress the clutch 
to stop the car stalling; later I release the brake to start accelerating, and then I 
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release the clutch. To see whether this fluent — SBEB(brake, clutch) — is statisti- 
cally signiflcant, we need two contingency tables, one for the relationship “start 
braking then start to depress the clutch” and one for “end braking and then end 
depressing the clutch” : 

s(x=clutch)s(x!=clutch) e(x=clutch)e(x!=clutch) 



s(x=brake) 


al 


bl 


e(x=brake) 


a2 


b2 


s(x!=brake) 


cl 


dl 


e(x!=brake) 


c2 


d2 



Imagine some representative numbers in these tables: Only rarely do I start 
something other than braking and then depress the clutch, so cl is small. Only 
rarely do I start braking and then start something other than depressing the 
clutch (otherwise the car would stall), so 61 is also small. Clearly, al is relatively 
large, and dl bigger, still, so the first table has most of its frequencies on a 
diagonal, and will produce a signiflcant statistic. Similar arguments hold for 
the second table. When both tables are signiflcant, we say SBEB(brake, clutch) is 
a signiflcant composite fluent. 

Fluent learning algorithm. The fluent learning algorithm incrementally processes 
a time series of binary vectors. At each tick, a bit in the vector bt is in one of 
four states: 



Still off: bt-i = 0 A 6( = 0 
Still on: bt-i = 1 A 6* = 1 
Just off: bt-i = 1 A 6( = 0 
Just on: bt-i = 0 A 6t = 1 



The fourth case is called opening; the third case closing. It is easy to test when 
base fluents (those corresponding to propositions) open and close, slightly more 
complicated for composite fluents such as SBEB(/i,/ 2 ), because of the ambiguity 
about which fluent opened. Suppose we see open(/i) and then open(/ 2 ). It’s 
unclear whether we have just observed open(SBEB(/i,/ 2 )), open(SAEB(/i,/ 2 )), 
or open(SAEw(/i,/ 2 )). Only when we see whether /2 closes after, before, or with 
fi will we know which of the three composite fluents opened with the opening 
of / 2 - 

The fluent learning algorithm maintains contingency tables that count co- 
occurrences of open and close events. We restrict the number of ticks, m, by 
which one opening must happen after another: m must be bigger than a few 
ticks, otherwise we treat the openings as simultaneous; and it must be smaller 
than the length of a short-term memory^ At each tick, the algorithm first decides 
which simple and composite fluents have closed. With this information, it can 

® The short term memory has two kinds of justification. First, animals do not learn 
associations between events that occur far apart in time. Second, if every open 
event could be paired with every other (and every close event) over a long duration, 
then the fluent learning system would have to maintain an enormous number of 
contingency tables. 
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disambiguate which composite fluents opened at an earlier time (within the 
bounds of short term memory). Then, it finds out which simple and composite 
fluents have just opened, or might have opened. This done, it updates the open 
and close contingency tables for all fluents that have just closed. Next, it updates 
the X 2 statistic for each table and it adds the newly significant composite fluents 
to the list of accepted fluents. 

Two fluents learned by the algorithm are shown in Figure |5] (others are 
discussed in i). These fluents were never used by the robot for anything (besides 
learning other fluents) so they are not meaningful representations in the sense 
of Section [II but they illustrate the kind of structural abstractions produced 
by the fluent learning process. The first captures a strong regularity in how 
the robot approaches an obstacle. Once the robot detects an object visually, 
it moves toward it quite quickly, until the sonars detect the object. At that 
point, the robot immediately stops, and then moves forward more slowly. Thus, 
we expect to see saeb (near-object, stop), and we expect this fluent to start 
before move-forward, as shown in the first fluent. The second fluent shows 
that the robot stops when it touches an object but remains touching the object 
after the STOP fluent closes (sweb(touch,STOp)) and this composite fluent 
starts before and ends before another composite fluent in which the robot is 
simultaneously moving forward and pushing the object. This fluent describes 
exactly how the robot pushes an object. 



near-obstacle 

move-forward 

stop 

2. touch 
push 

move-forward 

stop 



Fig. 5. Two composite fluents. These were learned without supervision from a time 
series of 22535 binary vectors of robot data. 



Fluent learning works for multivariate time series in which all the variables 
are binary. It does not attend to the durations of fluents, only the temporal 
relationships between open and close events. This is an advantage in domains 
where the same episode can take different amounts of time, and a disadvantage 
in domains where duration matters. Because it is a statistical technique, fluent 
learning finds common patterns, not all patterns; it is easily biased to And more 
or fewer patterns by adjusting the threshold value of the statistic and varying 
the size of the fluent short term memory. Fluent learning elucidates the hierar- 
chical structure of episodes (i.e., episodes contain episodes) because fluents are 
themselves nested. We are not aware of any other algorithm that is unsupervised, 
incremental, multivariate, and elucidates the hierarchical structure of episodes. 
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The Voting Experts Algorithm. The voting experts algorithm is designed to 
find boundaries of substructures within episodes, places where one macroscopic 
part of an episode gives way to another. It incorporates “experts” that attend to 
boundary entropy and frequency, and is easily extensible to include experts that 
attend to other characteristics of episode structures. Currently the algorithm 
works with univariate sequences of categorical data. The algorithm simply moves 
a window across a time series and asks, for each location in the window, whether 
to “cut” the series at that location. Each expert casts a vote. Each location takes 
n steps to traverse a window of size n, and is seen by the experts in n different 
contexts, and may accrue up to n votes from each expert. Given the results of 
voting, it is a simple matter to cut the series at locations with high vote counts. 
The algorithm has been tested extensively with sequences of letters in text: 
Spaces, punctuation and capitalization are removed, and the algorithm is able 
to recover word boundaries. It also performs adequately, though not brilliantly, 
on sequences of robot states. Research in that domain continues. 



Here are the steps of the algorithm: 

Build a prefix tree of depth n+ 1. Nodes at level i of an prefix tree represent 
ngrams of length i. The children of a node are the extensions of the ngram 
represented by the node. For example, a b c a b d produces the following prefix 
tree of depth 2: 




a (2) b (2) c (1) d (1) 




ab (2) bed) bd (1) ca(l) 

Every ngram of length 2 or less in the sequence a b c a b d is represented by 
a node in this tree. The numbers in parentheses represent the frequencies of 
the subsequences. For example, the subsequence a b occurs twice, and every 
occurrence of a is followed by b. 

For the first 10,000 characters in George Orwell’s book 1984, a prefix tree 
of depth 7 includes 33774 nodes, of which 9109 are leaf nodes. That is, there 
are over nine thousand unique subsequences of length 7 in this sample of text, 
although the average frequency of these subsequences is 1.1; most occur exactly 
once. The average frequencies of subsequences of length 1 to 7 are 384.4, 23.1, 
3.9, 1.8, 1.3, 1.2, and 1.1. 

Calculate boundary entropy. The boundary entropy of an ngram is the en- 
tropy of the distribution of tokens that can extend the ngram. The entropy of 
a distribution of a random variable x is just — Pr(x) log Pr(x). Boundary 
entropy is easily calculated from the prefix tree. For example, the node a has 
entropy equal to zero because it has only one child whereas the entropy of node 
b is 1.0 because it has two equiprobable children. 
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Standardize frequencies and boundary entropies. In most domains, there 
is a systematic relationship between the length and frequency of patterns; in 
general, short patterns are more common than long ones (e.g., on average, for 
subsets of 10,000 characters from Orwell’s text, 64 of the 100 most frequent 
patterns are of length 2; 23 are of length 3, and so on). Our algorithm will 
compare the frequencies and boundary entropies of ngrams of different lengths, 
but in all cases we will be comparing how unusual these frequencies and entropies 
are, relative to other ngrams of the same length. To illustrate, consider the words 
“a” and “an”. In the first 10000 characters of Orwell’s text, “a” occurs 743 times, 
“an” 124 times, but “a” occurs only a little more frequently than other one-letter 
ngrams, whereas “an” occurs much more often than other two-letter ngrams. In 
this sense, “a” is ordinary, “an” is unusual. Although “a” is much more common 
than “an” it is much less unusual relative to other ngrams of the same length. 
To capture this notion, we standardize the frequencies and boundary entropies 
of the ngrams. Standardized, the frequency of “a” is 1.1, whereas the frequency 
of “an” is 20.4. We standardize boundary entropies in the same way, and for the 
same reason. 



Score potential segment boundaries. In a sequence of length k there are 
k — 1 places to draw boundaries between segments, and, thus, there are 2^“^ 
ways to divide the sequence into segments. Our algorithm is greedy in the sense 
that it considers just fc — 1, not 2^“^, ways to divide the sequence. It consid- 
ers each possible boundary in order, starting at the beginning of the sequence. 
The algorithm passes a window of length n over the sequence, halting at each 
possible boundary. All of the locations within the window are considered, and 
each garners zero or one vote from each expert. Because we have two experts, 
for boundary-entropy and frequency, respectively, each possible boundary may 
garner up to 2n votes. This is illustrated in Figure El A window of length 3 is 
passed along the sequence itwasacold. 



Initially, the window covers itw. The entropy and frequency experts each de- 
cide where they could best insert a boundary within the window. The boundary 
entropy expert votes for the location that produces the ngram with the highest 
standardized boundary entropy, and the frequency expert places a boundary so 
as to maximize the sum of the standardized frequencies of the ngrams to the 
left and the right of the boundary. In this example, the entropy expert favors 
the boundary between t and w, while the frequency expert favors the boundary 
between w and whatever comes next. Then the window moves one location to 
the right and the process repeats. This time, both experts decide to place the 
boundary between t and w. The window moves again and both experts decide 
to place the boundary after s, the last token in the window. Note that each 
potential boundary location (e.g., between t and w) is seen n times for a window 
of size n, but it is considered in a slightly different context each time the window 
moves. The first time the experts consider the boundary between w and a, they 
are looking at the window itw, and the last time, they are looking at was. 
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Fig. 6. The operation of the voting experts algorithm. 



In this way, each boundary gets up to 2n votes, or n = 3 votes from each of 
two experts. The wa boundary gets one vote, the tw boundary, three votes, and 
the sa boundary, two votes. 

Segment the sequence. Each potential boundary in a sequence accrues votes, 
as described above, and now we must evaluate the boundaries in terms of the 
votes and decide where to segment the sequence. Our method is a familiar “zero 
crossing” rule: If a potential boundary has a locally maximum number of votes, 
split the sequence at that boundary. In the example above, this rule causes the 
sequence itwasacold to be split after it and was. We confess to one embellish- 
ment on the rule: The number of votes for a boundary must exceed a threshold, 
as well as be a local maximum. We found that the algorithm splits too often 
without this qualification. In the experiments reported below, the threshold was 
always set to n, the window size. This means that a location must garner half 
the available votes (for two voting experts) and be a local maximum to qualify 
for splitting the sequence. 

The algorithm performs well at a challenging task, illustrated below. In this 
block of text — the first 200 characters in Orwell’s 1984 — Eill spaces and punctu- 
ation have been excised, and all letters made capital; and to foil your ability to 
recognize words, the letters have been recoded in a simple way (each letter is 
replaced by its neighbor to the right in the alphabet, and Z by A): 



HSVZRZAQHFGSBNKCCZXHMZOQHKZMCSGDBKNB.JRVDQDRSQ 

HJHMFSGHQSDDMVHMRSNMRLHSGGHRBGHMMTYYKDCHMSNG 

HRAQDZRSHMZMDEENQSSNDRBZODSGDUHKDVHMCRKHOODC 

PTHBJKXSGQNTFGSGDFKZRRCNNQRNEUHBSNQXLZMRHNMRS 

GNTFGMNSPTHBJKXDMNTFGS 



Suppose you had a block of text several thousand characters long to study at 
leisure. Could you place boundaries where they should go, that is, in locations 
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that correspond to words in the original text? (You will agree that the original 
text is no more meaningful to the voting experts algorithm than the text above 
is to you, so the problem we pose to you is no different than the one solved by 
the algorithm.) 

To evaluate the algorithm we designed several performance measures. The hit 
rate is the number of boundaries in the text that were indicated by the algorithm, 
and the false positive rate is the number of boundaries indicated by the algorithm 
that were not boundaries in the text. The exact word rate is the proportion of 
words for which the algorithm found both boundaries; the dangling and lost 
rates are the proportions of words for which the algorithm identifies only one, or 
neither, boundary, respectively. We ran the algorithm on corpora of Roma-ji text 
and a segment of Franz Kafka’s The Castle in the original German. Roma-ji is a 
transliteration of Japanese into roman characters. The corpus was a set of Anime 
lyrics, comprising 19163 roman characters. For comparison purposes we selected 
the first 19163 characters of Kafka’s text and the same number of characters 
from Orwell’s text. We stripped away spaces, puncuation and capitalization and 
the algorithm induced word boundaries. Here are the results: 



Hit rate F. P. rate Exact Dangling Lost 



English 


.71 


.28 


.49 


.44 


.07 


German 


.79 


.31 


.61 


.35 


.04 


Roma-ji 


.64 


.34 


.37 


.53 


.10 



Clearly, the algorithm is not biased to do well on English, in particular, 
as it actually performs best on Kafka’s text, losing only 4% of the words and 
identifying 61% exactly. The algorithm performs less well with the Roma-ji text; 
it identifies fewer boundaries accurately (i.e., places 34% of its boundaries within 
words) and identifies fewer words exactly. The explanation for these results has 
to do with the lengths of words in the corpora. We know that the algorithm 
loses disproportionately many short words. Words of length 2 make up 32% of 
the Roma-ji corpus, 17% of the Orwell corpus, and 10% of the Kafka corpus, 
so it is not surprising that the algorithm performs worst on the Roma-ji corpus 
and best on the Kafka corpus. 

As noted earlier, the algorithm performs less well with time series of robot 
states. The problem seems to be that episode substructures are quite long (over 
six seconds or 60 discrete ticks of data, on average, compared with Orwell’s av- 
erage word length, around 5.) The voting experts algorithm can find episode 
structures that are longer than the depth of its prefix tree, but recall that the 
frequency of ngrams drops with their length, so most long ngrams occur only 
once. This means the frequency and boundary entropy experts have no distribu- 
tions to work with, and even if they did, they would have difficulty estimating 
the distributions with any accuracy from such small numbers. 

Still, it is remarkable that two very general heuristic methods can segment 
text into words with such accuracy. Our results lead us to speculate that fre- 
quency and boundary entropy are general markers of episode substructures, a 
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claim we are in the process of testing in other domains. Recall that Oates used 
these heuristics to bootstrap the process of finding words in the speech signal. 



6 Conclusion 

The central claim of this paper is that programs can learn representations and 
their meanings. We adopted Dretske’s definition that a representation is mean- 
ingful if it reliably indicates something about the external world and the indicator 
relationship is exploited to inform action. These criteria place few constraints on 
what is represented, how it is represented, and how representations inform ac- 
tion, yet these questions lie at the heart of AI engineering design, and answering 
them well requires considerable engineering skill. Moreover, these criteria admit 
representations that are meaningful in the given sense to an engineer but not 
to a program. This is one reason Dretske [12] required that the function of the 
indicator relationship be learned, to ensure that meaning is endogenous in the 
learning agent. Dretske’s requirement leads to some philosophical problems |10| 
and we do not think it can survive as a criterion for contentful mental states [8|. 
However, we want programs to learn the meanings of representations not as a 
condition in a philosophical account of representation, meaning and belief, but as 
a practical move beyond current AI engineering practice, in which all meanings 
are exogenous; and as a demonstration of how symbolic representations might 
develop in infant humans. 

What is the status of our claim that programs can learn representations and 
their meanings? As our adopted notion of meaning does not constrain what 
is represented, how it is represented, and how representations inform action, 
we have considerable freedom in how we gather evidence relevant to the claim. 
In fact, we imposed additional constraints on learned representations in our 
empirical work: They should be grounded in sensor data from a robot; the data 
should have a temporal aspect and time or the ordinal sequence of things should 
be an explicit part of the learned representations; and the representations should 
not merely inform action, but should inform two essentially human intellectual 
accomplishments, language and planning. We have demonstrated that a robot 
can learn the meanings of words, and construct simple plans, and that both 
these abilities depend on representations and meanings learned by the robot. In 
general, we have specified how things are to be represented (e.g., as transition 
probability matrices, sequences of means and variances, multivariate time series, 
fluents, etc.) but the contents of the representations (i.e., what is represented) 
and the relationship between the contents and actions have been learned. 
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Abstract. The growing use of information visualization tools and data 
mining algorithms stems from two separate lines of research. Informa- 
tion visualization researchers believe in the importance of giving users 
an overview and insight into the data distributions, while data mining 
researchers believe that statistical algorithms and machine learning can 
be relied on to find the interesting patterns. This paper discusses two 
issues that influence design of discovery tools: statistical algorithms vs. 
visual data presentation, and hypothesis testing vs. exploratory data 
analysis. I claim that a combined approach could lead to novel discovery 
tools that preserve user control, enable more effective exploration, and 
promote responsibility. 
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Abstract. In this paper, we study the problem of using statistical query 
(SQ) to learn a class of highly correlated boolean functions, namely, a 
class of functions where any pair agree on significantly more than 1/2 
fraction of the inputs. We give an almost-tight bound on how well one 
can approximate all the functions without making any query, and then 
we show that beyond this bound, the number of statistical queries the 
algorithm has to make increases with the “extra” advantage the algo- 
rithm gains in learning the functions. Here the advantage is defined to 
be the probability the algorithm agrees with the target function minus 
the probability the algorithm doesn’t agree. 

An interesting consequence of our results is that the class of booleanized 
linear functions over a hnite field {f{a{x) = 1 iff <j>{a ■ x) = 1, where 
4> is an arbitrary boolean function that maps any elements in GFp to 
±1) is not efficiently learnable. This result is useful since the hardness of 
learning booleanized linear functions over a finite field is related to the 
security of certain cryptosystems f jBOll 'l. In particular, we prove that the 
class of linear threshold functions over a finite field {f{a,b{x) = 1 iff a • 
X > b) cannot be learned efficiently using statistical query. This contrasts 
with Blum et. al.’s result |BFK-f9^ that linear threshold functions over 
reals (perceptions) are learnable using the SQ model. 

Finally, we describe a PAC-learning algorithm that learns a class of linear 
threshold functions in time that is provably impossible for statistical 
query algorithms. With properly chosen parameters, this class of linear 
threshold functions become an example of PAC-learnable, but not SQ- 
learnable functions that are not parity functions. 



1 Introduction 

Pioneered by Valiant [ V 84] . machine learning theory is concerned with prob- 
lems like “What class of functions can be efficiently learned under this learning 
model?” . Among different learning models there are the Probably Approximately 
Correct model (PAC) by Valiant [V84J and the Statistical Query model (SQ) by 
Kearns [K98| . 

The SQ model is a restriction to the PAC model, where the learning algorithm 
doesn’t see the samples with their labels, but only get the probabilities that a 
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predicate is true: to be more precise, the learning algorithm provides a predicate 
g(x^ y) and a tolerance e, and an SQ oracle returns a real number v that is e-close 
to the expected value of g{x, f{x)) according a distribution of x, where / is the 
target functions. While seemingly a lot weaker than the PAC model, SQ model 
turns out to be very useful: in fact, a lot of known PAC learning algorithms are 
actually SQ model algorithms, or can be converted to SQ model algorithms. The 
readers are referred to |K98| for a more comprehensive description. 

One interesting feature for SQ model is that there are information-theoretical 
lower-bounds on the learnability of certain classes of functions. Kearns IK98I 
proved that parity functions cannot be efficiently learned in the SQ model. Blum 
et. al. IBP J -1-94] extended his result by showing that if a class of functions has 
“SQ-dimension” (informally, the maximum number of “almost un-correlated” 
functions in the class, where the “correlation” between two functions is the 
probability these two functions agree minus the probability they disagree) d, 
then a SQ learning algorithm has to make queries, each of tolerance 

0(d-i/3) in order to weakly learn In jJOO], Jackson further strengthened 
this lower bound by proving that 17(2") queries are needed for an SQ-based 
algorithm to learn the class of parity functions over n bits. This result can 
be extended to any class of completely uncorrelated functions: 17(d) queries 
are needed for an SQ-based algorithm to learn a class of functions if this class 
contains d functions that are completely uncorrelated. Notice that this upper 
bound is optimal: |BFJ-|-94l proved that there are weak-learning algorithms for 
the class of functions using 0{d) queries. 

In this paper, we study the problem of learning correlated functions. Suppose 
there is a class of boolean functions T = {/i,/ 2 , ■■■,/d}, where any pair func- 
tions fi, fj are highly correlated, namely fi and fj agree on (1 + A)/2 fraction 
of the inputs, where A can be significantly larger than 0 (say, A = 1/3). There 
are natural classes of correlated functions: for example, the “booleanized linear 
functions” in a finite field GFp defined in this paper. Informally, these functions 
are of the form fa{x) = (j){a-x), where </> (called a “booleanizer” ) is an arbitrary 
function that maps any element in GFp to a boolean value (-1-1 or —1), and both 
a and x are vectors over GFp. Booleanized linear functions can be viewed as 
natural extensions to parity functions (which are linear functions in GF 2 ), and 
intuitively, should be hard to learn by statistical query (since parity functions 
cannot be efficiently learned by statistical query) . Actually they are (implicitly) 
conjectured to be hard to learn in general, and there are cryptosystems whose 
security is based on the assumption that booleanized linear functions are hard to 
learn. One example is the “blind encryption scheme” proposed by Baird [Bfl1| : 
Roughly speaking, this private-key crypto-scheme picks a random /„ as the se- 
cret key, and encrypts a ‘0’ bit by a random x such that fa{x) = -1-1, and a T’ 
bit by a random x such that fa{x) = — 1 Knowing the secret key, decryption is 
just an invocation of the fa, which can be done very efficiently. Furthermore, it 
is (implicitly in |B01J 1 conjectured, that, by only inspecting random plaintext- 
ciphertext pairs (x,fa{x)), it is hard to learn the function fM However, the 

^ This is not exactly what the “blind encryption scheme” does, but is similar. 
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results from |K98J . IBFJ+94] , [JOOJ don’t immediately apply here since these 
booleanized linear functions are indeed correlated, and the correlation can be 
very large (for example, |BFJ+94| requires the the correlation between any two 
functions to be 0{l/dP), where for the booleanized linear functions, the correla- 
tion is of order and can even be constants). 

Notice that in the case of correlated functions, the notion of “weak learning” 
can become trivial: if any pair of functions have correlation A, i.e., they agree on 
(1 -I- A)/2 fraction of the inputs, then by always outputing f\{x) on every input 
X, an algorithm can approximate any function fi with advantage at least A, (the 
advantage of an algorithm is defined as the probability the algorithm predicts a 
function correctly minus the probability the algorithm predicts incorrectly). So 
if A is non-negligibly larger than 0, this algorithm “weakly learns” the function 
class without even making any query to the target function. 

In the first part of this paper, we prove that without making any query, an 

algorithm can have maximally advantage in approximating all the 

target functions /i, / 2 , ...., /d, if any pair has almost the same correlation A. 
We show this bound is almost tight by demonstrating an example where the 

advantage can be almost achieved for a specific class of functuions. 

Also we prove that in order to have an “extra” advantage S, about \/d ■ S/2 
queries are needed. This shows a advantage-query complexity trade-off: the more 
advantage one wants, the more queries one has to make. One consequence of our 
result is that booleanized linear functions cannot be learned efficiently using 
statistical query, and if the booleanizer is “almost unbiased” and the finite field 
GFp is large, one cannot even weakly learn this class of functions. Our result 
provides some positive evidence towards the security of the blind encryption 
scheme by Baird uni]. 

The technique we used in the proof, which could be of interest by itself, is 
to keep track of the “all-pair statistical distance” between scenarios when the 
algorithm is given different target functions — we denote this quantity by A. 
We prove that: 

1. Before the algorithm makes any query, Z\ = 0. 

2. After the algorithm finishes all the queries, A is “large” . 

3. Each query only increases Z\ by a “small” amount. 

And then we conclude that a lot of queries are needed in order to learn F well. 

One interesting consequence from our result is that the class of linear threshold 
functions are not efficiently learnable. A linear threshold function in a finite field 
is defined as fa,b{x) = 1 if a • a: > 6, and —1 otherwise, where a G GF// and 
b G GFp. These linear threshold functions over GFp are interesting, since their 
counterparts over reals are well-known as “perceptions” and they learnability 
are well studied. Blum et. al. |BFK-|-96| proved that there are statistical query 
algorithms that learn linear threshold function over reals in polynomial time, 
even in the presence of noise — that contrasts sharply with our result. 

In the second part of this paper. We present a learning algorithm, BUILD- 
TREE , that learns a class of linear threshold functions over a finite field GFp 
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where the threshold b is fixed to be (p + l)/2. Our algorithm uses a random 
example oracle, which produces a random pair {x, fa{x)) upon each invocation. 
The algorithm’s running time is while the brutal-force search algo- 
rithm takes time and any statistical query learning algorithm also has to 

take time to even weakly learn the functions. If we “pad” the input prop- 

erly, we can make BUILD-TREE ’s running time polynomial in the input size, 
while still no SQ learning algorithms can learn the class efficiently. This gives an 
example of PAC-learnable, but not SQ-learnable class of functions. Previously, 
both |Kfi8J and |IjPJ-|-94| proved that the class of parity functions fits into this 
category, and later |BKW00] proved that a class of noisy parity functions also 
fits. Our example is the first class of functions in this category that are corre- 
lated and not parity functions. This result provides some insights towards better 
understanding of SQ-learning algorithms. 

The rest of the paper is organized as follows: section 2 gives some notations 
and definitions to be used in this paper; section 3 proves a lower bound for 
SQ-learning algorithms; section 4 discusses the algorithm BUILD-TREE and its 
analysis. 

Due to space constraint, most proofs to the lemmas and theorems in this 
paper are omitted. 



2 Notations and Definitions 

We give the notations and definitions to be used in the paper. 

2.1 Functions and Oracles 

Throughout this paper we are interested in functions whose input domain is a 
finite set fi, where |I2| = M, and whose outputs domain is {— 1, -Ll}. An input x 
to a function / is called a positive example if f{x) = -Ll, and a negative example 
if f{x) = —1. Sometimes when the function / is clear from the context, we 
call the value of f{x) the label of x. In a lot of cases, f2 takes a special form: 
fl = GFp, where p is a prime number and n is a positive integer. In this case, 
we write an input in the vector form: x, we use a:* to denote its i-th entry, an 
element in GFp. 

We now define the notion of learning a functions. The overall model is an 
algorithm A with oracle access to a function / that A tries to learn (we call 
/ the target function) . A is given an input X and makes queries to the oracle. 
Finally A outputs a bit as its prediction of A{X). 

We use an “honest SQ-oracle” model, which is similar to the definition of 
“SQ-based algorithm” in pno]: 

Definition 2.1. An honest SQ-oracle for function f takes two parameters g 
and N as inputs, where g : GFp x {— 1,4-1} ^ {— 1,4-1} is a function that 
takes an input in GFp and a boolean input and outputs a binary value, and N is 
a positive integer written in unary, called the sample count. The oracle returns 
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^ where each Xi is a random variable independently chosen 

according to a pre-determined distribution D. We denote this oracle by HSQf 

Notice that this definition of an honest SQ-oracle is different from the mostly- 
used definition of a “normal” SQ-oracle (sometimes denoted as STATf) as 
in IAD98I . IBFJ-l-94] . |_BFK-|-96] , IBKWOOI . IK98II . Kearns IK98I proved that 
one can simulate a STATf oracle efficiently in PAC learning model, and De- 
catur [DM] extensively studied the problem of efficiently simulating a STATf 
oracle. Both their results can be easily extended to show that an honest SQ- 
oracle can be used to efficiently simulate a “normal” SQ-oracle. Therefore a 
lower bound with respect to an honest SQ-oracle automatically translates to a 
lower bound with respect to a “normal” SQ-oracle up to a polynomial factor. 



2.2 Bias and Inner Products of Functions 

We define the bias of a real function / over L2 to be the expected value of / under 
a distribution D, and we denote that by (/)d: 

(/)d = Eoifix)] = D{x)f{x) 



We define the inner product of two real functions over f2 to be the expected 
value of f ■ g, denoted by {f,g)D- 

{f,g)D = ED[f{x)g{x)] = D{x) ■ f{x)g{x) 

In the rest of the paper, we often omit the letter D if the distribution is clear 
from the context. 

We can also view the inner product as the “correlation” between / and g. It 
is easy to verify that the definition of inner product is a proper one. Also it is 
important to observe that if / is a boolean functions, i.e., \/x,f{x) € {— 1,-|-1}, 
then (/, /) = 1. 



2.3 Approximating and Learning Functions 

Given a function / : f? — > {—1, -1-1} and an algorithm A which take elements in 
fl and outputs ±1, we can measure how well A approximates /. The algorithm 
could be a randomized one and thus the output of A on any input is a random 
variable. We define the characteristic function of algorithm A to be a real-valued 
function over the same domain f2: tpA '■ El [—1, -|-1], such that 

ifA{x) = 2 • Pr[A outputs 1 on a;] — 1 

where the probability is taken over the randomness A uses and, if A make oracle 
queries, the randomness from the oracles. 
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It is easy to verify that iPa{x) is always within the range [—1,1]. Given a 
probabilistic distribution D over 17, we define the advantage of algorithm A in 
approximating function / to be 

(/, ijjA) = ^fA,D[A agrees with /on inputa;]— Pr^^D[^ disagrees with /on input x] 

where the probability is taken over the randomness from A and the x that is 
randomly chosen from 17 according to D. 

It is not hard to see that if A always agrees with /, then 'ipA = f, and the 
advantage of A in approximating / is 1; if A randomly guesses a value for each 
input, then ipA = Oj and the advantage of A is 0. 

For a class of functions tF, and an oracle algorithm A, we say A approximates 
T with advantage a if for every function f G iF, the advantage of A in approxi- 
mating / is at least a. In the case A queries an honest SQ-oracle HSQf in order 
to approximate the target function /, we say A learns T with advantage a with 
respect to an honest SQ-oracle. 

We note that the “advantage” measure for learning a function isn’t very dif- 
ferent from the more commonly used “accuracy/confidence” measure in PAG 
learning. Recall that an algorithm learns T with accuracy e and confidence <5, 
if for any f G IF, the algorithm A, using an oracle about /, with probability at 
least 1 — 5, agrees with / with probability at least 1 — e. It is easy to prove the 
following facts: 

Lemma 2.1. Let T he a class of boolean functions over fi, and let A he an 
oracle algorithm. If A learns T with accuracy e and confidence 6, then A learns 
T with advantage at least 1 — 2e — 26 . On the other hand, if A learns T with 
advantage at least a, then A learns T with accuracy e and confidence 6 for any 
(e, 6) pair satisfying 



The proof is a simple application of the Markov Inequality. 

Therefore, roughly speaking: if an algorithm A learns IF with high confidence 
and high accuracy (A “strongly” learns IF), then the advantage of A in learning 
T is close to 1; if A learns T weakly, then the advantage of A is non-negligibly 
higher than 0. On the other hand, if the advantage A has in learning IF is close 
to 1, then A (strongly) learns T . 

The reason that we use the advantage measure in this paper is that we want 
to show a continuous “trade-off” result between how many queries are needed 
and how “well” an algorithm learns T , and using one single parameter makes 
the discussion more convenient. 

2.4 Booleanized Linear Functions and Linear Threshold Functions 
in Finite Fields 

Suppose p is a prime number and n a positive integer. Given an arbitrary function 
that maps inputs from GF/ to boolean values, 

4> • GFp —>■ {— 1 , - 1 - 1 } 
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we define a class of booleanized linear functions as a collections of boolean 
functions: 

•^0 = {/a,0(®) := (j){a-x)\a€ GF^}, 
and we call function (j) the booleanizer. 

Booleanized linear functions can be viewed as natural extensions of parity 
functions (which are linear functions over GF^). 

If the booleanizer function, cj), is a threshold function: 

r 1 , if x > & 

Mx) = < 

[ — 1 , if X < & 

we call the corresponding class of booleanized linear functions linear threshold 
functions, and denote the functions by fa,b- 



2.5 The Tensor Product and Statistical Distance 



Given two probabilistic distributions D and D' over spaces A and A' , we define 
their tensor product I? ® D' to be a new distribution over Ax A' \ 

PrD®D'[{.X,X') = {x,x')] = Pro[X = x] ■ Pro'[X' = x'] 



Given a finite space A and distributions Di, D2, ■■■, Dm over A, we define the 
all-pair L2 statistical distance (abbreviated as SD) among Di, D2, ■■■, Dm to be 



SD 2 {Di,D 2 ,...,Dm) 



T.T.T.i^^DAX = x]-PruAX = x])^ 

i=l j=l xGA 



Under this definition, it is easy to see that 



SD 2 {D,D) = 0 



and 



SD2 (Di, D2, ..., Dm) 



^ m m 

bEESD2(A,D,)2 

i=i i=i 



for TO > 2. 

One useful property of the all-pair L2 statistical distance is the sub-additivity: 



Lemma 2 . 2 . Let Di, D2, Dm be distributions over A and D[, D'2, ■■■, D'm be 
distributions over A' . Then we have 

SD2(DiGI?(, D2G-D2, ..., Dm®D'm) < SD2(Di, Z? 2 , ..., Dm)+SP> 2 {D'i, D'2, ..., D'm) 

□ 




66 



K. Yang 



Since each random variable naturally induces a distribution, we can also de- 
fine all-pair L 2 statistical distance among random variables: For random vari- 
ables Xi, X 2 , ■■■, Xm, their all-pair L 2 statistical distance is defined to be the 
all-pair L 2 statistical distance among the the distributions induced by them. 
The sub-additivity property remains true: suppose we have random variables 
Xi,X 2 , Xm and Yi, I 2 , Y^, such that Xi is independent to Yj for any pair 

of i,j G {1, 2, to}, we have 

SD2(XiYi,X2Y2,...,XmYm)<SD2{XuX2,...,Xra) + SD2{Yi,Y2,...,Y^) 

2.6 Chernoff Bounds 

We will be using Chernoff bounds in our paper, and our version is from jM H,h5| . 

Theorem 2.1. Let X\, X 2 , ■■■, Xn be a sequence of n independent {0^1} random 
variables. Let S be the sum of the random variables and p, = if [(S']. Then, for 
0 < (5 < 1, the following inequalities hold: 

Pr[5 > {l + S)p] < 

and 

Pr[5 < (1 - S)p] < 

□ 



3 Statistical Query Model: Negative Results 

In this section we present a negative result characterizing the Statistical Query 
Model. 

Throughout this section, we use 17 to denote a finite set of size M and we are 
interested in functions mapping elements in 17 to -1-1 or —1. 

3.1 Statistical Dimension and Fourier Analysis 

Definition 3.1. Let Q be a finite set of size M and let T be a class of boolean 
functions whose input domain is 17, and D a distribution over 17, we define 
SQ-DIM(J^, 17), the statistical query dimension of iF with respect to D, to be 
the largest natural number d such that there exists a real number A, satisfying 
0 < A < 1/2, and that T contains d functions fi, / 2 , ...., fd with the property that 
for all i ^ j , we have 

\{h,fj)-M<^ 

Notice the definition of SQ-DIM in |BFJ-|-94l can be regarded as the special 
case where 17 = {— 1,-|-1}"’ with the restriction that A = 0. 

Notice though each of the functions /i,/ 2 , ■•■,/d can be highly correlated to 
others, we view this correlated a “false correlation” : as we will prove in the next 
lemma, we can “extract” d new functions fi, f 2 , ■■■, fd from /i, / 2 , ■.■, /d, such 
that the new functions are almost totally uncorrelated to each other. 
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Lemma 3.1. Let L2, D, d, A, and /i, / 2 , /d be as defined in definition \3.1\ 
and A > 0. We define d real-valued funetions /i, / 2 , /d" 






^i + (d- i)A 



j = 2 



( 1 ) 



Then we have 



and 



\{fiji) - 1| < ^ , Vi 


(2) 




8 




\{kh)\< 




(3) 






□ 



So we now get a group of functions /i, / 2 , fd that are “almost” orthogonal. 
However, these d new functions are nor necessarily boolean functions. 

Next, we can extend this group of functions to a basis and perform Fourier 
analysis on the basis. The part of analysis are very similar to the proofs in 
IBFJ+941 , but with different parameters and (sometimes) improved bounds. 
The detailed analysis are in the appendix. 



3.2 Approximating a Ftmction without a Query 

We give an upper bound on the advantage an algorithm A can have to approxi- 
mating a class of functions, if A doesn’t make any queries. 



Theorem 3.1. Let fl be a finite set of size M , let D be a probabilistic distribu- 
tion over [2. Let IF be a class of boolean functions T = {/i, / 2 , ..., fd}, such that 
\{fi, fj) ~ -^1 ^ 1/df for all pairs i j, where A > 0. Let g : L2 ^ [—1, -|-1] be the 
characteristic function of an algorithm A such that {g, fi) > T for i = 1,2, ..., d. 
Then we have 

for d > 100. □ 

We next show that this bound is “almost tight”, i.e., we give an example 
where T = ’\j ■ 



Theorem 3.2. For any odd prime p and any integer n > 2, there exists a class 
of d = boolean functions over GFff: T = {/i, / 2 , ..., fd}, and a distribution 
D, such that: any pair of the functions has identical inner product A and the 

inner product of constant function g{x) = 1 and any fi is {g,fi)o = \J ■ 
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3.3 A Lower Bound for Statistical Query Learning 

We have proved that without making any queries, a learning algorithm cannot 
learn a function family with advantage more than \J ■ Next we show 

that in order to improve the advantage, a lot of (normally exponentially many) 
queries have to be made. More precisely, we have the following theorem: 



Theorem 3.3. Let f2 he a finite set of size M , let D he a prohahilistic distrihu- 
tion over 17. Let T he a class of boolean functions T = {/i, / 2 , ..., fd}, such that 
\{fij fj) ~ M ^ for all pairs i j, where 1/2 > A > 0 and d > 100. Let A be 
an algorithm that makes Q queries to an honest SQ-oracle , each of which has 

sample count at most N, and learns T with advantage S + > where 

S > d~^!'^ , then we have 



NQ> 



Vd-S 



□ 



We comment that the total running time of A is bounded by NQ, since N is 
written in unary. Therefore thus the running time of A is also bounded from 
below by Vd ■ S/2. This theorem gives a tradeoff between the running time of 
A and the “extra” advantage it can have in learning the running time goes 
up linearly with the advantage, and especially, to get a constant advantage, a 
running time of I7(d^/^) is needed. 

Proof. We assume A is a Turing Machine. Suppose the target function is fj. We 
define the state of A after the fc-th query to be the binary string 5/ that describes 
the contents on A’s tapes, the position of the heads, the current internal state of 
A. We define S/ to be the state of A before A starts. Notice each S'/ is a random 
variable: the randomness comes from both the honest SQ-oracle and the random 
coins A tosses. 

In the rest of the proof, we will omit the subscript A if there is no danger of 
confusion. 

We define Ak to be the all-pair L 2 statistical distance among S/, S/, ..., S/: 
Z\, = SD2(S/,S/,...,S/) 

Intuitively, A/, measures how “differently” A behaves when it has different 
target functions as inputs from the oracle. 

We shall prove the following lemmas (in the appendix) considering the A^’s: 

Lemma 3.2. Aq = 0. 

This is obvious since A hasn’t made any queries yet, and the state of A is 
independent of the target function. 



Lemma 3.3. Ak+i — A^ <N-Vd 



□ 
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Next we show the that in order to learn T with large advantage, the all-pair 
statistical distance has to be large. 



Lemma 3.4. If A learns T with advantage S + where S > d 

Then Aq > ^. □ 



Now putting lemma [H2] 13.21 and lemma [Tl] together, we have NQ\/d > 



or 



NQ> 



Vd-S 



As a comparison, implicit in pnn] is the following theorem: 

Theorem 3.4 (Implicit in [JOO] L Let fi be a finite set of size M, let D he 
a probabilistic distribution over 12. Let F be a class of boolean functions F = 
{/i) / 2 , /d}> such that {fi, fj) = 0 for all pairs i fi- j. Let A be an algorithm 

that makes Q queries to an honest SQ-oracle, each of which has sample count at 
most N , and learns T with advantage S, then we have 



NQ = f2{d) 



1 



Proof’s sketch: Notice this is the case that all target functions are completely 
orthogonal and Fourier Analysis works perfectly: one can extend F to an or- 
thonormal basis directly. Suppose A has an advantage of S. Then the charac- 
teristic function fiA has an coefficient at least S for the target function. How- 
ever Ip A can have at most 1/S'^ coefficients that are larger than or equal to S, 
by Parseval’s equality. One can simply query 1/S'^ more times to completely 
determine the target functions and have an advantage of 1. But as proved in 
urn, f2{d) queries are needed to learn F with advantage 1. Therefore we have 
A^Q= I7(d- 1/52). □ 



The bound in fJOOl is a bound for the specific case that all functions are 
orthogonal to each other, and in this case, it is better than the bound given 
in this paper. Our bound is weaker, but it works for a more general class of 
functions. 



3.4 Hardness for Learning Booleanized Linear Ftmctions over a 
Finite Field 

Next we show that the class of booleanized linear functions cannot be learned 
efficiently using statistical query: 

Theorem 3.5. Let p be an odd prime and n > 1 an integer. Let p : GFp 
{ — 1,-|-1} be a booleanizer such that — ^ < {p) < and let F be a class of 
booleanized linear functions: 



T = {/a.0 I a” = 1} 
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Let D be the uniform distribution over GFff . Then any algorithm with an aecess 
to an honest SQ-oraele that learns T with advantage \{4>)\+S+l/p^~^ , where S > 
p-{n-i)/A ^ respect to distribution D, has a running time at least G . 

S/2. □ 

Notice that it is not hard to prove that each function fa^cj, has bias (</)). If 
((/)) > 0, then the constant function g(x) = +1 already has an advantage (</>) in 
approximating J-, otherwise g{x) = — 1 has an advantage {(f) in approximating 
T. 

This result gives some positive evidence towards the security of the private-key 
cryptosystem proposed by Baird [BOlj . 

Since linear threshold functions are special cases for booleanized linear func- 
tions, we have the following theorem: 

Theorem 3.6. Let p be an odd prime and n > 1 an integer. Let b be a non-zero 
element in GFp such that (p — l)/4 < b < 3(p — l)/4, and let T be a class of 
linear threshold functions: 



T = {/„.6 I a ” = 1 } 

Let D be the uniform distribution over GF^ . Then any algorithm with an access 
to an honest SQ-oracle that learns T with advantage |(p— 26)/p| -I- S' -I- l/p^~^, 
where S > with respect to distribution D, has a running time at least 

p{n-l}/2 . s/2. 

Furthermore, we have: 

Corollary 3.1. For the class of linear threshold functions, in the case that p is 
exponentially large in n and b = {p -\- l)/2, no statistical query algorithm can 
weakly learn T . 

Proof. When b= {p-\- 1)/2, we have {p — 2b) /p = —1/p, which is exponentially 
small in n. If an algorithm A weakly learns F, it has to have an advantage e > -^ 
for some constant c. Then by theorem 13.61 the running time of A has to be at 
least • (e — 1/p — l/p"“^), which is exponentially large in n. 

4 Algorithm for Linear Threshold Functions 

In this section we present an algorithm BUILD-TREE that learns a special class 
of linear threshold functions as shown in corollary 1 3.1 1 using a random example 
oracle. The running time of BUILD-TREE is slightly better than the brutal-force 
algorithm, and also slightly better than the lower bound for the statistical query 
model. 

We first state the problem: pick an integer n > 1 and an odd prime p such 
that p is exponentially large in n. Let b = {p 1)/2, and the class of functions 
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is the class of linear threshold functions with the fixed b and with the constraint 
that the n-th entry of a is 1: 



^={/a|a" = lil 

The distribution over the inputs is the uniform distribution. We show an algo- 
rithm that learns any function / € IF in time with advantage 0.5, 

with respect to a random example oracle. Notice the brutal-force algorithm that 
examines all possible functions has running time and any SQ-algorithm 

much also have a running time to have a constant advantage in learning 

T. 



4.1 Description of the Algorithm 

The idea for BUILD-TREE is pretty intuitive: given a target function fa, we know 
there is a “secret vector” a associated with the function. If one picks a random 
negative example x, then the expected value of a • a; is (p — l)/4. If we draw 
(4g + 1) independent random negative samples, the expected sum of the inner 
products is about (4g -|- l)(p — l)/4, which is about (p — l)/4 modulo p, if g <C p 
(in our algorithm, we have q = O(logn) = O(loglogp)). So it is more likely that 
the sum of (4g -|- 1) random negative examples is still a negative example than 
is a positive one. The algorithm exploits this “marginal difference” , boosts it by 
Chernoff bound, and gains a constant advantage in learning T . What BUILD- 
TREE does is: it first draws negative examples, and when getting an 

input A, it tries to write A as the sum of (4g-|-l) negative examples it drew, and 
estimates the success probability. If the success probability is high, it outputs 
“/(A) = —1”, otherwise it outputs “/(A) = -|-1”. The name of the algorithm 
comes from the fact that the algorithm estimates the probability by building a 
complete binary tree from the samples it draws. 

Our algorithm is inspired by the algorithm Blum et. al. used in [IBKWOOJ to 
learn noisy parity functions, where the main idea is also trying to write an input 
as the sum of logarithmically many samples. 

Now we describe BUILD-TREE in more detail: 

The algorithm BUILD-TREE has a random example oracle EXj, which, at 
each invocation, produces a random pair (a:, fa{x)), where x is uniformly chosen 
from GFp. The algorithm also has an input A on which it tried to predict /q,(A). 

The algorithm consists of 2 phases. In Phase I, it draws about 
samples and processes them; in Phase II, it reads the input A and tries to 
build a complete binary tree from the samples it drew in Phase I, where each 
node is a multi-set of elements in GF^; finally, BUILD-TREE counts the number 
of elements in the root node and use this number to predict /^(A). In the 
description thet follows, we identify “groups” with “multisets”. 

In the rest of this section, we omit b in the description of fa.,b since b is fixed to be 
(p+l)/2. 



2 
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— Phase I: We define a = logn/2 and h = 2n/logn, and think of each vector 
in GFp as divided into a blocks, each block containing b elements in GFp. 
We define 

K = -2^ ^ ■ n 

BUILD-TREE draws 2“~^(2“-|- 1)K negative samples. Notice each fa is “rea- 
sonably balanced” and thus there would be no trouble getting enough neg- 
ative samples. We use N to denote BUILD-TREE groups these sam- 

ples into 2“ -I- 1 groups of N elements each, and denotes these groups by 
Go,Gi, ...,G 2 “- Then it add the last 2 groups G 2 <^-i and G 2 » entry-wise to 
form a new group, G' 2 a_i. More precisely, suppose G 20-1 = {ai, 02 , ..., oat} 
and G 2 £* = {bi,b 2 , then 

G'2a_i = {oi + bi,a2 + &2, Oat + 

is also a group of N numbers. Now define G' = Gi, for t = 0, 1, ...,2“ — 2, 
and now we have 2“ groups G\, ...,G' 2 a_i of N elements. 

— Phase II: In this phase BUILD-TREE gets a new sample X and it tries 
to learn fa{X). The approach is to try to write X as the sum of 2“ -|- 1 
negative samples drawn from phase I. More precisely BUILD-TREE tries to 
find 2“ elements Xq,Xi, ...,X 2 <^-i, such that Xi e G' for i = 0,1,. ..,2“ — 1 
and 

X = Xq Xi -|- • • • -|- 3^2“ — 1- 

Notice X 2 <^-i G G 2 ^_i is already a sum of 2 negative samples, and thus if 
one can find such 2“ elements, X is the sum of 2“ -|- 1 negative samples. 
Since BUILD-TREE is working in GFp, it can compute Y = ^X, and sub- 
tract Y from each element in each group G'. More precisely, we define 

A, = {x-Yjx€G'} 

for i = 0, 1, ....,2“ — 1. Then the task for BUILD-TREE becomes finding 2“ 
elements, one from each Ai such that they add up to 0. 

To do so, BUILD-TREE will build a complete binary tree of multi-sets. First 
some notations: We define the height of a node in a binary tree as the shortest 
distance from this node to a leaf node, and a leaf node has height 0. The 
height of a binary tree is the height of its root. A node that is neither a leaf 
node nor the root node is called an internal node. There are (2“ — 2) internal 
nodes for a complete binary tree of height a. 

Here is the actual construction: 

BUILD-TREE will build a complete binary tree of height a, and there are 
2 a-fc ]^Q(jeg Qf height k: we will denote these nodes by Gq,Gi, ...,G 2 a-k_i- 
The construction is from bottom-up: one builds the nodes of height 0, or the 
leaf nodes first, and then the node of height 1,2,..., a — 1, and finally the 
root node. An invariant that BUILD-TREE maintains is: all non-root nodes 
of height I contain 2°“~^~^K elements, all of which have O’s at the first I 
blocks. 
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• LEAF NODES: 

The leaf nodes are just the sets Aq, Ai, A 2a_i. In other words, let 
G° = A, fort = 0, 

• INTERNAL NODES: 

After all the nodes of height {I — 1) are built, BUILD-TREE constructs 
the nodes of height 1. 

To construct node G-, BUILD-TREE needs nodes G^”^ and G^y+i, 
namely, the two children nodes of G\. The BUILD-TREE does the fol- 
lowing: 

It starts by setting G- to be the empty set and label all elements in G^”^ 
and G27+1 as “unmarked”. 

It repeats the following “SELECT-AND-MARK” process for A 

times: 

BEGIN OF SELECT-AND-MARK 

* BUILD-TREE (arbitrarily) picks an unmarked element u G G^”^, and 
scans G27+1 to check if there is an unmarked element v € G27+1, such 
that u + V has the first I blocks all-zero. Notice that both u and v 
has the first I — 1 blocks all-zero already, and thus BUILD-TREE is 
actually looking for a v whose Lth block is the complement of that 
of u. 

* If BUILD-TREE finds such a v, it puts u + v into G\ and labels both 
u and V as “marked” . 

* If BUILD-TREE can’t find such a v, it aborts: the algorithm fails. 

END OF SELECT-AND-MARK 

If BUILD-TREE doesn’t abort in the 2“-'"iA SELECT-AND-MARK 
processes, it constructs a set G\ of size 2““*“^ A. 

• ROOT NODE: 

If BUILD-TREE doesn’t abort in constructing the (2“ — 2) internal nodes, 
it proceeds to build the root node, Gg. Notice the children of node Gg 
are nodes Gg“^ and G“~^, each of which contains A elements: suppose 
that 

Gg“^ = {ui,U2, ■■;Uk} 

and 

G“"^ = {Vi,V2,...,Vk} 

Then the root node Gg is 

Gg = {ui + Vi \ + Vi = 0, i = 1, 2, ..., A} 

In other words, Gg is a multi-set of O’s, and the size of Gg depends on 
the number of corresponding pairs of vectors in Gg~^ and G“~^ that are 
complement to each other. 

In this way BUILD-TREE builds a complete binary tree all the way up the 
the root. If the size of the root node is greater than 2^ • n, BUILD-TREE 

outputs “/(A) = —I”; otherwise is outputs “/(A) = -|-I”. 
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4.2 Analysis of the BUILD-TREE Algorithm 

The detailed analysis is in the appendix, and we state the theorem here: 

Theorem 4.1. With probability at least 0.8, the BUILD-TREE algorithm learns 
T with aecuracy 1 — other words, the BUILD-TREE algorithm learns 

T with advantage 0.5, and has a running time □ 

It is interesting to compare BUILD-TREE with the algorithm used in [BKWOO] . 
to learn parity functions in the presence of noise, which also draws many samples, 
view each sample as blocks, and tries to write an input as the sum of O(logn) 
samples, and both algorithms have a similar sub-exponential bound. However, 
there are differences: in [BKWOO] , the algorithm draws samples with labels, and 
it writes an input as the sum of O(logn) samples to fight the noise — if there 
were no noise, it is easy to learn the function by Gauss elimination; in this paper, 
BUILD-TREE only draws negative examples, and it writes an input as the sum 
of 0(log n) negative sample to create a probabilistic gap — there is no noise in 
the problem. Furthermore, the algorithm in jBKWOOJ is satisfied with justing 
finding a way to write an input as a sum of 0(log n) samples, while BUILD-TREE 
has to estimate the probability that an input can be written as a sum of O(logn) 
samples, and thus is more complicated in this sense. 

Notice that, using the same “padding” technique as in [BKWOO] , we can make 
the BUILD-TREE is polynomial-time algorithm: one simply pad zeros to 

the input of BUILD-TREE , and then BUILD-TREE ’s running time becomes 
polynomial in the input length. However, still no polynomial-time algorithms 
can learn this class of linear threshold functions in statistical query model. This 
gives an example of PAC-learnable, but not SQ-learnable class of functions. 
Previously, both | K98| and | BFJ-I-94| proved that the class of parity functions 
fits into this category, and later [BKWOOj proved that a class of noisy parity 
functions also fits. The linear threshold functions over a finite field is the first 
class of functions in this category that are not parity functions. We hope this 
result can provide further insights into SQ-learning algorithms. 

5 Conclusions and Open Problems 

In this paper, we discussed the problem of learning (highly) correlated func- 
tions in the statistical query model. We showed an almost-tight upper bound 
of the advantage an algorithm can have in approximating a class of functions 
simultaneously. We also showed that any SQ algorithm trying to get a better 
advantage in learning the class of functions has to make a lot of queries. A con- 
sequence of our result is that the class of booleanized linear functions over finite 
fields are not SQ-learnable, which include linear threshold functions. Finally we 
demonstrated a PAG learning algorithm that learns a class of linear threshold 
functions with constant advantage and running time that is provably impossible 
for SQ- algorithms. With proper padding, our algorithm can be made in poly- 
nomial time, and thus putting linear threshold functions into the category of 
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PAC-learnable, but not SQ-learnable functions, and they are the first class in 
this category that are not parity functions. 

The technique we used in this paper to prove the lower bound is to keep 
track of the “all-pair statistical distance” between scenarios when the algorithm 
is given different target functions Appaently, our technique is similar to the 
one used in |A00| . where the author proved a lower bound of quantum queries a 
quantum search algorithm has to make, but in a different setting. Their technique 
is to keep track of the sum of the absolute values of the off-diagonal entries in in 
the system’s density matrix — we denote this quantity by S. Roughly speaking, 
the author in [A 00] proved that: 

1. Before the algorithm makes any quantum queries, S is large. 

2. After the algorithm finishes all the queries, S is small. 

3. Each quantum query only decreases S' by a small amount. 

And then they conclude that lot of quantum queries are needed. It would be in- 
teresting to investigate if there is a deeper relationship between the 2 techniques. 

People already understand SQ-learning un-correlated functions: both lower 
bounds and upper bounds on the number of queries are shown, and the two 
bounds match. Our paper gives a lower bound for SQ-learing a class of functions 
that are correlated the same way, but no matching upper bound is known. Even 
less is known for the case that all the functions are correlated, but not in the 
same way. In general, given d functions fi, f 2 t ■■■, fd and their pair-wise corre- 
lation {fi,fj) for all i ^ j, can we find a good lower bound for the number of 
queries needed to learning these d functions well? Is there an (even non-uniform) 
matching upper bound? 

Another interesting problem is: do there exist efficient algorithms to learn 
booleanized linear functions over finite fields? For parity functions over GF 2 , 
they are easy to learn when there is no noise, and hard if there is noise — the 
state of art are Blum et. al.’s algorithm [BKWOOj . which takes time 
for rr-bit parity functions with respect to uniform noise of constant rate, and 
Goldreich-Levin- Jackson’s algorithm [GL89j . jJ 00] . which takes time 0(2"/^) for 
n-bit parity functions, with respect to uniform noise of rate (1/2 — l/poly(n)) 
and some classes of malicious noise. However, in the case of finite fields of large 
characteristics, it seems it is hard to learn the booleanized linear functions even 
without noise. Notice an efficient learning algorithm will break Baird’s “blind 
computation” cryptosystem, and an hardness result will automatically translate 
to a security proof for Baird’s system. 

Another interesting topic is learning functions in finite fields in general: in- 
stead of limiting the outputs of functions to be boolean, we can allow functions 
to output elements in a finite field, or some other large domains. What kind of 
functions are learnable? 
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Abstract. We improve the analysis of the decision tree boosting algo- 
rithm proposed by Mansour and McAllester. For binary classification 
problems, the algorithm of Mansour and McAllester constructs a multi- 
way branching decision tree using a set of multi-class hypotheses. Man- 
sour and McAllester proved that it works under certain conditions. We 
give a much simpler analysis of the algorithm and simplify the condi- 
tions. From this simplification, we can provide a simpler algorithm, for 
which no prior knowledge on the quality of weak hypotheses is necessary. 



1 Introduction 

Boosting is a technique to construct a “strong” hypothesis combining many 
“weak” hypotheses. This technique was first proposed by Schapire [9] originally 
to prove the equivalence between strong and weak learnability in PAC-learning. 
Many researchers have improved boosting techniques such as AdaBoost and 
so on. (See for example, |4|3|10|11| .) Among them, Kearns and Mansour (6| 
showed that the learning process of well-known decision tree learning algorithms 
such as CART @ and C4.5 |8] can be regarded as boosting, thereby giving some 
theoretical justification to those popular decision tree learning tools. 

More precisely, Kearns and Mansour formalized the process of constructing 
a decision tree as the following boosting algorithm. For any binary classification 
problem, let H 2 be a set of binary hypotheses for this classification problem. 
Starting from the trivial single-leaf decision tree, the learning algorithm im- 
proves the tree by replacing some leaf of the tree (chosen according to a certain 
rule) with an internal node that corresponds to a hypothesis h € H 2 (again 
chosen according to a certain rule). It is shown that the algorithm outputs a 
tree T with its training error below where s is the number of leaves of T, 
provided that for any distribution, there always exists some hypothesis in H 2 
whose “advantage” is larger than 7 (0 < 7 < 1) for the classification problem. 
This implies that (1/e)^^/'''^ steps are sufficient for the desired training error e. 
(See the next section for the detail; in particular, the definition of “advantage”.) 

There are two extensions of the result of Kearns and Mansour. Takimoto 
and Maruoka generalized the algorithm for multi-class learning m- Their algo- 
rithm uses, for any fixed K > 2, Al-class hypotheses, i.e., hypotheses providing 

N. Abe, R. Khardon, and T. Zeugmann (Eds.): ALT 2001, LNAI 2225, pp. 77-|9^ 2001. 

(c) Springer- Verlag Berlin Heidelberg 2001 
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K branches. On the other hand, Mansour and McAllester gave a generalized 
algorithm that constructs a decision tree (for binary classification) by using 
multi-class hypotheses that provide at most K branches [7j. That is, their al- 
gorithm may construct a decision tree having nodes with different number of 
branches. 

In this paper, we improve the analysis of Mansour and McAllester ’s algo- 
rithm. 

Consider the situation of constructing a decision tree of size s for a given 
s by using multi-class hypotheses such that the size of the ranges are bounded 
by some constant K > 2. Mansour and McAllester showed that their algorithm 
produces a size s decision tree with training error bound under the following 
condition. 

The condition of Mansour and McAllester 

At each boosting step, there always exists a hypothesis h satisfying the 
following: 

(1) h is either binary (in which case fc = 2) or fc-class with some k, 2 < 
k < K, such that k < (s/s')(7e.y(fc)/2), where s' is the current decision 
tree size, and 

(2) h has advantage larger than ^g^{k). 

Here g-y and are defined by 

g^{k) = ^ Ink, and e^{k) = . 

^ i—l 

This result intuitively means that if we can assume, at each boosting step, some 
hypothesis that is better than a binary hypothesis with advantage 7, then the 
algorithm produces a tree that is as good as the one produced by the original 
boosting algorithm using only binary hypotheses with advantage 7. (Note that 
g-y{2) = 1 by definition.) 

We simplify their analysis, thereby obtaining the following improved con- 
dition, which also makes the algorithm simpler. (Here we consider the same 
situation and the goal as above.) 

Our condition 

At each boosting step, there always exists a hypothesis h satisfying the 
following: 

(1) h is either binary (in which case fc = 2) or fc-class with some k, 2 < 
k < K, such that k < s/s' , and 

(2) h has advantage larger than 7 [log fc] . 

This condition is simpler, and the above explained intuition becomes clearer 
under this condition. The item (2) of this new condition means that fc-class 
hypothesis h is better than an “equivalent” depth [log k~\ decision tree consisting 
of binary hypotheses with advantage 7. That is, if we can always find such a 
hypothesis at each boosting step, then the algorithm produces a tree that is as 
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good as the one produced by the original boosting algorithm using only binary 
hypotheses with advantage 7 . 

In fact, based on this new interpretation, we propose to compare the quality 
of weak fc-class hypotheses for different k based on the quantity computed as the 
information gain over [log fc] . This simplifies the original algorithm of Mansour 
and McAllester, and moreover, by this modification, we no longer need to know 
a lower bound 7 for the advantage of binary weak hypotheses. 

Technically, the item (2) of our condition is stronger (i.e., worse) than the 
original one; this is because [logfc] > g^{k) . But the item (1) of our condition 
is weaker (i.e., better) than the original one. 

In our argument, we introduce Weight Distribution Game for analyzing the 
worst-case error bound. 



2 Preliminaries 

We introduce our learning model briefly. Our model is based on PAG learning 
model proposed by Valiant [13]. Let X denote an instance space. We assume the 
unknown target function / : A — > {0,1}. The learner is given a sample S of 
m labeled examples, S = {{xi, f{xi )), . . . , (xi, f{xm))), where each Xi is drawn 
independently randomly with respect to an unknown distribution P over X. 
The goal of the learner is, for any given constants e and ^ (0 < e, <5 < 1), to 
output a hypothesis hf \ X —> {0, 1} such that its generalization error e{hf) 
Prp[/(x) yf hf{x)] is below e, with probability at least 1 — 5. 

In order to accomplish the goal, it is sufficient to design learning algorithms 
based on “Occam Razor.” [I]. Namely, it is sufficient to construct a learning 
algorithm that outputs a hypothesis hf satisfying the following conditions: For 

sufficiently large sample, (1) hf’s training error e{hf) Prp[/(a:) yf hf{x)] is 
small, where D is the uniform distribution over S, and (2) size{hf) = o{rn), 
where size{-) represents the length of the bit string for hf under some fixed 
encoding scheme. 

In this paper we consider decision tree learning, i.e., the problem of construct- 
ing a decision tree satisfying the above PAG learning criteria. More precisely, for 
a given target /, we would like to obtain some decision tree T representing a 
hypothesis hx whose generalization error is bounded by a given e (with high 
probability). Note that if a hypothesis is represented as a decision tree, the sec- 
ond condition of Occam learning criterion can be interpreted as the number of 
leaves of the tree being sufficiently small with respect to the size of the sample. 
By the Occam Razor approach mentioned above, we can construct a decision 
tree learning algorithm that meets the criteria. 

Here we recall some basic notations about decision trees. We assume a set 
H of hypothesis h : X ^ R^, where 2 < < K for any fixed integer K. We 

allow each h G H to have different range. We denote the set of decision trees 
determined by H as T{H). A decision tree consists of internal nodes and leaves. 
Let T be any decision tree in T{H). Each internal node is labeled a hypothesis 
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h £ H and it has \Rh\ child nodes corresponding to the value of h, where child 
nodes are either internal nodes or leaves. Each leaf is labeled 0 or 1. 

When the number of branches of each node in T is some fixed fc > 2, we 
call T a k-way branching decision tree. In general cases, including cases when 
the number of branches of each node in T is different, we call T a multi-way 
branching decision tree. 

Finally let us clarify here how a decision tree is used to classify a given 
instance. Suppose that an instance x G X is given to a decision tree T. First 
the root node of T is visited. Then the child node corresponding to the values 
of h{x) is visited next, and so on. Finally, x reaches to some leaf. T answers the 
label of the leaf. 

Now we introduce the notion of boosting. In the classical definition, boosting 
is to construct a hypothesis hf such that Pr£i[/(x) yf hf{x)] < e for any given 
e combining hypotheses hi,...,ht, where each hi satisfies that Pr£>J/(x) yf 
hi{x)] < 1/2 — 7 for some distribution Di over X {i = 1, . . . ,t, for some t > 
1). However, in this paper, we measure the goodness of hypotheses from an 
information-theoretic point of view. For this we use the pseudo entropy proposed 
by Takimoto and Maruoka [12]. 

Definition 1 A function G : [0,1]^ ^ [0,1] is a pseudo -entropy function if, for 
any qq, qi G [0, 1] such that Qo + 9i = 

1. min{go,9i} < G{qo,qi), 

2- G{qo,qi) = 0 9o = 1 or qi = 1, and 
3. G is concave and symmetric about (1/2, 1/2). 

For example. Shannon entropy function go log(l/go) + <Zi log(l/gi) and y/qoqi 
(proposed by Kearns and Mansour [B]) are pseudo-entropy functions. Next we 
define the entropy of function / using a pseudo-entropy function G. 

Definition 2 The G-entropy of / with respect to D, denoted by is 

defined as 



i?g(/) ='G(go,9i), 

where q, = Pr£,[/(x) = i] {i = 0, 1). 

We can interpret G-entropy as “impurity” of the values of / under the dis- 
tribution D. For example, if / takes only one value, G-entropy becomes the 
minimum. If the value of / is random, G-entropy becomes the maximum. 

We also define the conditional G-entropy given a hypothesis h : X ^ Rh, 
where Rh is a finite set but possibly different from {0, 1}. 

Definition 3 The Conditional G-entropy of / given h with respect to D, de- 
noted by Hg{f\h), is defined as Hg{f\h) =' ^^D[h(x) = j]G(go|j, 9i|j), 

where q,\j = Pr£,[/(x) = i\h{x) = j] {i = 0,1, j G Rh). 
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Since the range of h may be different from that of /, we have to give a 
way to interpret values of h. More precisely, we define a mapping M : Rh 

def 

{0,1}: M{j) = arg^^g ^ maxg^ij. We show the following relationship between 
the classification error and the G-entropy. 

Proposition 1 Pr£)[/(x) yf M{h{x))] < H^{f\h). 

Proof. We denote Wj = Pvu[h{x) = j] for j G Rh- Then, we have 

Pr[/(x) yf M{h{x))] = + M{h{x))\h{x) = j] 

j&Rh 

j&Rh 

< H ^iG'(go|y,gi|y) = H^{f\h). 

j&Rh 



□ 

We note that if H§{f\h) = 0, then the error probability Pi'£)[/(x) yf M{h{x))] 
also becomes 0. 

Following relationship between “error-based” and “information-based” hy- 
potheses was first proved by Kearns and Mansour [^. 

Lemma 2 (Kearns and Mansour j6]) Suppose G{qo,qi) = ^qoqi- For any 
distribution D over X, if there exists a hypothesis h : X ^ (0, 1} such that 
PiDifix) yf h{x)] < 1/2 — 6, then there exists a hypothesis h' : X ^ (0, 1} such 

thatiJg(/)-i/g(/|h')>f;iFg(/). 

Motivated from the above lemma, we state our assumption. We assume a set of 
“information-based weak hypotheses” of the target function /. 

Definition 4 Let / be any boolean function over X . Let G : [0, 1]^ — > [0, 1] be 
any pseudo-entropy function. Let H be any set of hypotheses. H and G satisfy 
the 'j-weak hypothesis assumption for / if for any distribution D over X, there 
exists a hypothesis h G H satisfying 

hEU) - HEif\h) > ^hEU), 

where 0 < 7 < 1. We call this constant 7 advantage and refer to the reduction 
HEU) - HEU\h) as gain. 

3 Learning Binary Decision Trees 

Before studying the multi-way branching decision tree learning algorithm, we re- 
view a binary decision tree learning algorithm proposed by Kearns and Mansour 

0 . 
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For the binary target function /, this algorithm constructs binary decision 
trees. We assume that the algorithm is given some pseudo-entropy function G 
and a set H2 of binary hypotheses h : X ^ R^, = 2 in advance. We call the 

algorithm TOPDOWNc.Ha- The description of TOPDOWNg.Hs is given in 
Figure [TJ 

In what follows, we explain the idea and the outline of this algorithm. 
TOPDOWNg.Hsj given a sample S and an integer s > 2 as input, outputs 
a decision tree with s leaves, where internal nodes are labeled with functions 
in i?2- The algorithm’s goal is to obtain a decision tree that has small training 
error on S under the uniform distribution D over S. For this, the algorithm tries 
to reduce the conditional G-entropy of /, given a constructed decision tree. 

Let T be a decision tree that the algorithm has constructed so far. (Initially, 
r is a tree with a single leaf.) Let L{T) denote the set of leaves of T. (We denote 
|T| = |L(T)|.) Then T can be regarded as a mapping from X to L{T). For each 

i € L(T), we define Fto[T{x) = P\ and qi\i Frr>[f{x) = i\T{x) = 

(where z = 0, 1). 

Then the training error of T, which we denote as e(T), is computed as follows: 
e(r) = Pr£)[/(a:) yf M(T(a;))], where M(£) = arg^^Q ;^ max for any £ G L(T). 
We denote the conditional G-entropy of / given T as 

eGL{T) 

From Proposition [T] we have e(T) < H^{f\T). In order for reducing this entropy, 
the algorithm makes a local change to the tree T. At each local change, the 
algorithm chooses a leaf £ G L(T) and h G H2 and replaces £ with a new internal 
node labeled h (and its two new child leaves). The tree obtained in this way is 
denoted Ti^h- Compared with T, Te^h has one more leaf. 

We explain the way to choose £ and h at each local change. The algorithm 
chooses £ that maximizes wgG{qQ\^, qi\i); it calculates a sample Si that is a subset 
of S reaching £. Finally the algorithm chooses h G H2 that maximizes the gain 
~ G[^^{f\h), where Di is the uniform distribution over Si. 

Note that H§{f\T) - H§{f\Ti^h) = Wi{H%^{f) - H%^{f\h)). Thus, if the gain 
is positive, then we reduce the conditional G-entropy. 

For the efficiency of this boosting algorithm, Kearns and Mansour showed the 
following result. 

Theorem 3 (Kearns and Mansour |5]) Assume that H 2 and G satisfy the 
7-weak hypothesis assumption. Then, TOPDOWNg H 2 (<S', s) outputs T with 
KT) < Hg{f\T) < s-T 

4 Learning Multi-way Branching Decision Trees 

We propose a simpler version of Mansour and Me Allester’s algorithm, which 
constructs multi-way branching decision trees. We weaken and simplify some 
technical conditions of their algorithm. 
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TOPDOWNgh2(.5',s) 

begin 

T <— the single leaf tree; 

While |T| < s times do 

i ^ argmax^g£(T) wiG{qQ\i, q^e)\ 

Si^{{xJ{x))£S\T{x)=ty, 

De <— the uniform distribution over Sr, 
h ^ argmaxhgH2(f^Df(/) ~ H§yf\h))- 
T^Ti,h-, 

end-while 

Output T ; 

end. 



Fig. 1. Algorithm TOPDOWNg.H 2 



Let H he & set of hypotheses where each h G H is a, function from X to Rh 
(2 < \Rfi\ < K) ■ We assume that H contains a set of binary hypotheses i?2- 
The algorithm is given H and some pseudo-entropy function G : [0,1]^ ^ [0,1] 
beforehand. We call the algorithm TOPDOWN-Mq h- TOPDOWN-Mg.h, 
given a sample S and an integer s > 2 as input, outputs a multi-way branching 
decision tree T with [Tj = s. 

The algorithm is a generalization of TOPDOWNg.h 2- One of the main 
modification is the criterion to choose hypotheses. TOPDOWN-Mg,h chooses 
the hypothesis h : X ^ R^ such that maximizes the gain over [log ji?/!]], not 
merely comparing the gain. Because the given size of the tree is limited, in order 
to reduce the conditional G-entropy as much as possible, it is natural to choose a 
hypothesis with smaller range among hypotheses that have the same amount of 
gain. On the other hand, the criterion that Mansour and McAllester’s algorithm 
uses is the gain over Note that it is necessary to know 7 to compute 

57- 

The other modification is a constraint of hypotheses with respect to the size 
of the tree. We say that a hypothesis h is acceptable for tree T and target size s 
if either \Rh\ = 2 or 2 < \Rh\ < s/\T\. 

Note that if |T| > s/2, then only binary hypotheses are acceptable. How- 
ever H contains H 2 thus the algorithm can always select a hypothesis that is 
acceptable for any T and s. We show the details of the algorithm in Figure |2] 
Note that if H 2 C H satisfies the 7-weak hypothesis assumption and hypothesis 
h : X ^ Rh is selected for i, then 

> l\log\RhUHE^if). 

4.1 Our Analysis 

We give an analysis for the algorithm TOPDOWN-Mg.h- 
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TOPDOWN-MG,H(g, s) 
begin 

T ^ the single-leaf tree; 

While (|T| < s) do 
£ ^ argmax^g£(T) weG{qo\t, qi\e); 

Si ^ {{x, fix)} G S\T{x) = £}■, 

De <— the uniform distribution over Si ; 

, Hg(f)-Hg{f\h) 

h^argmax ^iog|ri,|i 

acceptable tor 1 and s 

T^TiX, 

end-while 

Output T ; 

end. 



Fig. 2. Algorithm TOPDOWN-Mq.h 



First we define some notations. For any leaf i, we define weight as 

Wt wtGiqo\i,qi\i). 

The weight of a tree is just the total weight of all its leaves. Then by definition, 
we immediately have the following relations. 

Fact 1 1. e(T) < J2ieL(T) 

2. If i ?2 and G satisfy y-weak hypothesis assumption, then for any leaf i and 
weights of £’s child leaves Wi , . . . , Wk (2 < k < K), we have IFi -I- • ■ • -I- Wk < 
(1 - yflog k'\)We. 

From Fact CKi), in order to bound the training error of the tree, it suffices for 
us to consider the weight of the tree. On the other hand, it follows from Fact [I] 
(2) that at each boosting step, the weight of the tree gets decreased by at least 
yflogfcjlF^, provided that a leaf £ is “expanded” by this boosting step and a 
fc-class hypothesis is chosen for £. Thus, we need to analyze how the weight of 
the tree gets decreased under the situation that, at each boosting step, (i) the 
leaf £ of the largest weight is selected and expanded, and (ii) the weight gets 
decreased exactly by 7 [log fc] Wt, when a A:-class hypothesis is chosen for £. That 
is, we consider the worst-case under the situation and discuss how the tree’s 
weight gets decreased in the worst-case. Notice that the only freedom left here 
is (i) the number of child nodes k under the constraint k = 2oi2<k< s/\T\, 
and (ii) the distribution of the weights Wi , . . . , Wk of child leaves of £ under 
the constraint Wi -I- • • • -I- Wk = (1 — "f\\ogk'\)Wi. Therefore, for analyzing the 
worst-case, we would like to know the way to determine the number of child 
nodes and the way to distribute the weight Wt to its child nodes in order to 
minimize the decrease of the total tree weight. This fact motivates us to define 
the following combinatorial game. 
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Weight Distribution Game 

1. Initially, the player is given a single- leaf tree T where weight of the leaf is 
W, (0 < W < 1), and an integer s (s > 2). 

2. While |r| < s, the player repeat the following procedures: 

a) Choose the leaf £ that has the maximum weight. 

b) Choose any integer k (> 2) satisfying either k = 2or2<k< s/\T\. 

c) Expand the leaf l\ replace the leaf with an internal node with k child 
leaves and assign weights Wi , ■ ■ ■ , Wk of child nodes so that the following 
equation holds: Wi + ■ ■ ■ + Wk = (1 — l\^ogk'\)Wi. 

Player’s goal: Maximize the weight of the final tree T with s leaves. 

The best strategy for the player in this game is given by the following result. 
(The proof is given in the next subsection.) 

Theorem 4 In the Weight Distribution Game, the tree weight is maximized if 
the player always chooses the number of child leaves k equals to 2 and distributes 
the weight of expanded leaf equally to all its child nodes. 

Now the worst-case situation for our boosting algorithm is clear from this 
result and our discussion above. That is, the G-entropy of /, given the tree, gets 
decreased slowest when every chosen hypothesis is binary and divides the leaf’s 
entropy equally to its all child nodes. Thus, by analyzing this case, we would be 
able to derive an upper bound of the training error. 

Theorem 5 Assume that H 2 C H satisfies the 7-weak hypothesis assumption 
for /. Then, TOPDOWN-Mg.h(5', s) outputs T with e(T) < iJg(/|T) < s~~^ . 

Proof. In the Weight Distribution Game, suppose that the player chooses the 
number of child nodes k = 2 and distributes the weight of each leaf equally 
among its new child leaves. Then the player always chooses a leaf in the oldest 
generation that is not expanded yet. Note that after s — 1 expansions, the number 
of the leaves of the tree becomes s. Let t = s — 1. Suppose that after the t th 
expansion, all leaves in the i th generation are expanded (we assume the initial 
leaf is in the first generation) and there are t' leaves expanded in the i -I- 1 th 
generation (0 < t' < 2*). Then the number of all expansions t is given by t = 

+ t' = 2^-i + t'. 

One can observe that just after all leaves in each generation are expanded, 
the weight of the tree is multiplied by (1 — 7). Thus after all expansions in the 
i th generation, the weight of the tree is W(1 — 7)*. Because the weight of each 
leaf in the i -I- 1 generation is IT(1 — 7)^/2*, the weight of the tree after the t th 
expansion is W(l — 7)*(1 — ^'7/2*). Note that 1 — a: < e~^ for any 0 < a; < 1 
and IT < 1. Then we have 




W(l-7)* 



< exp[-7(i -I- t' /2^)]. 
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From the fact that i + t'/2* > ln(2* + t'), we have 

exp[-7(i + t'/2*)] < exp[-7ln(2* + t')] 



□ 



4.2 Weight Distribution Game 

We are proving here Theorem[4l First we prepare some notations. Let T>k denote 
the set of all possible distributions over {1, . . . , fc} (2 < k < K). In particular, let 
dl be the uniform one, i.e., = (1/fc, . . . , l/k). We also define T> = ljfc>2^fc- 

Note that for any distribution dk € T>, the subscript k is the size of the domain 
of dk- 

For any sequences of distributions . . . , € V* {t > 1), we denote 

sum(VF, d^^\ . . . , d[,*^) as the weight of the tree obtained by the following way: 

1. the initial weight is W; and, 

2. at the i th step, the number of child nodes is ki and the way to distribute 

weights is specified by d{.*^ (1 < * < t). 

Then we consider a sequence of distributions corresponding to a sequence of 
hypotheses that are acceptable for s. We say that such a sequence is acceptable. 
More precisely, a sequence of distributions d^^ , . . . , d^^ with length t is acceptable 

for s if s = (fci — 1) H h (fct — 1) + 1 and for any integer i {1 < i <t), ki = 2, 

or 2 < fci < s/{{ki — 1) + • • • + {ki-i — 1) + 1}. By using this notation. Theorem 
|4]can be re-written as follows. 



Theorem 6 For any weight W, any integer s > 2 and any sequence of distri- 
butions d^^^^ , . . . , d[,*^ G V* that is acceptable for s, 

sum(W, d^^^\ . . . , d^*^) < sum(W, d^, . . . , d^). 

^ 

S—1 



To prove the theorem, we show that the following relation holds for any 
integer u (1 < u < s — 1) and any sequence of distributions d ^^^ , . . . , d^“^ G 



sum(VF, d^^^\ . . . , d^“^ d2 , . . . , d^) < sum(W, d^^^ , . 







where t is the integer such that the sequence d^^^ , . . . , d[,“^ , d^ , . . . , d^ G is 

acceptable for s. We prove this relation in Lemma [S] Before doing this, we begin 
with a rather simple lemma. 
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Lemma 7 For any weight W € [0, 1], any integer s > 2, any distribution dk G 
and the sequence of distributions i c ?2 G > that is acceptable for s 

and has length t, 



sum(bF, dk, d 2 , - ■ ■ , d^) ^ sum(fF, d^, . . . , d^). 

t s-l 

Proof. Suppose that the player does not choose a leaf whose weight is the max- 
imum. Instead, suppose the player always select a leaf in the oldest generation, 
that is not expanded yet, whose weight is the maximum among weights of all 
leaves in the same generation. We denote the sum under the situation above as 
sum'. In this situation, the player may not choose the leaf with the maximum 
weight at each step. That makes the sum less than that in the original setting. 
Thus, we have 



sum(IT, dk, d 2 , . . . , d^) < sum'(IT, dk, d^, . . . , dj). 

t t 

Now we prove that the following inequality holds. 

sum'(IF,dfc,d;,...,d;) < sum(IT, d^, d^, . . . , dj). (1) 

t t 

Let T' and T* be the trees corresponding to left and right side of the above 
inequality respectively. 

First, consider the case when T' and T* are completely balanced, i.e., the 
case just after all leaves of the trees in the i -I- 1 th generation are expanded for 
some i. Then the weights of both T' and T* are W(1 — 7 [logA:])(l — 7 )® thus 
inequality © holds. 

Second, we consider the other case, that is, T' and T* are not completely 
balanced. Suppose that T' and T* are trees such that all leaves in the * -I- 1 th 
generation and t' leaves in the i -|- 2 th generation are expanded (1 < t' < /c 2 ®). 
We denote the numbers of child nodes in the f -I- 2 th generation of both trees 
as J = /c2®. We also denote the weights of nodes in the f -I- 2 th generation of 
T' as Wi, . . . ,Wj. (W.l.o.g., we assume that Wi > ••• > Wj.) Then it holds 
that Wj- = W(l — 7 [logfc])(l — 7 )®. On the other hand, the weight of each 

node in the i -|- 2 th generation of T* are the same. We denote the weights as 
W = W{1 — 7 [logfc])(l — 'jY / J. Now we claim that for any t' , 1 <t' < J, 

t' 

^ Wj > t'W. 
i=i 

Suppose not, namely, ^3 < some t' . Then we have Wf < W. 

Because the sequence {Wj} is monotone decreasing, it holds that Yh^j=i ^3 < 
JW. But this contradicts that YY^j=i ^3 = =W{1 — 7 [log k~\){l — 7 )®. This 
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completes the proof of the claim. Then, we have 

t' 

sum'(lT, dk,d*„...,d;) =W(1 - 7 [log fcl )(1 - 7 )* - 7 V W, 

I ^=1 

<W{1 — 7 |"log fc] )(1 — 7 )* — ^t'W 
=smrL{W,dl,d*2,...,d*2). 

t 

Now we have proved the inequality ©• Next we prove the following inequality. 

sum(lT, dfc,d; ,^ 2 ) < sum(lT, . . . , 4)- 

t S — 1 

Suppose s is given as follows: s = 2* + s' (2* < s < 2*+^, s' >0). Then, we have 
sum(lT, d^, . . . , d^) =(1 - -iYW - 

S-1 

=(i-7r ( 1 -$^)^ 

= cj,^{s)W. 

On the other hand, suppose s is given as follows: s = fc2* + s" (fc2* < s < 
fc2* s" > 0). Then, we have 

sum(tT,dfc,d^,...,d;) =(1 - yflog /c] )(1 - 7 )'V 

t 

(l-7riogfc1)(l-7)" „. 

^ fc 2 *' 

=(i - 7 riogfc])(i - 7 )'' w 

= (1 - 7 [log k~l)(/>y(s/k)W 

We note that cj)^(2s) = (1 — 7 )^.y(s) and (j)j is monotone non-increasing function. 
That implies, 

'/'7(g) ^ y, (S\ ^ '/'7(g) 

(l_ 7 )liogfcj - VA:/ “ (1 - 7 )n°gfcl ■ 

Now the following inequality holds: 

sum(W, dfc,d;,. ..,d 2 ) <(1 - yflogA:]) 

t 

=sum(W,d;,...,d;). 

S-1 

This completes the proof. □ 
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Next we prove our main lemma. 

Lemma 8 For any weight W € [0, 1], any integer s > 2, any sequence of distri- 
butions . . . , ^ 2 ) ■■ ■ ^ ,that is acceptable for s, having length 

u + t, and the sequence of distributions . . . , > *^2 ^ 

having length u + t+ku — 2 , 



sum(fF, <sum(VF, 4 ^^^ , • ■ • , 4 !_i ^ , c^2 > • ■ • . c?2) 



i+fc-u — 1 



Proof. We prove this inequality starting from the right side. Note that if the 
sequence . . . , 4”\ <^ 2 ) ■ • ■ ) ^^2 of length m -|- t is acceptable for s, then the 

sequence d\}\ . . . , 4”4\ ^^21 • ■ • > ^2 ^ith length u -|- t -I- — 2 is also acceptable 

for s. 

Let T be the tree corresponding to the sum sum(bF, d{^\ . . . , j ^ 2 )- 

1 „ 1 ^ 

t-\-ku — 1 

Let Lu-i be the set of leaves of T after the tt — 1 th expansion. Then, we 
have 



sum(W,44-">44i^’'^2,- 



,d*2)= X! sum(W£,d;, . . . ,4), 






^GLu-1 



where ti is the number of expansions for leaf i and its descendant leaves. Note 
that t = 1 ™ Lu-i that has the maximum weight. 

(So £* is the u th leaf to be expanded.) Let Ti* be the subtree of T rooted at 
£* . Let Ti- be a tree that has the following properties: (i) |Tf. | = \Ti-\, and (ii) 
Ti- is generated according to dk ^ , . . . , whereas Ti- is generated according 

^ — k-ii -|- 1 

to d 2 , . . . , d 2 . Now we consider replacing Ti- with Ti- . To do this, we need to 
^ ^ 

guarantee that < \Ti-\. This is clear when = 2. Then we consider the 
other case. Because the sequence d\}^ , , d^“\ d^, • . . , d^ is acceptable for s, by 
definition, we have < s/|L„_i|. On the other hand, Ti- is the biggest subtree 
among all subtrees rooted at leaves in L«_i. This implies s/|Lu_i| < \Ti-\. Now 
we guarantee that < \Ti-\. From Lemma 0 we have 

sum(W£» , d2, . . . , d^) > sum(W^* , d^^ d^, . . . , dj). 

t£ t^-ku-\-l 



Thus we conclude that by replacing subtree Ti- with Ti - , the weight of T becomes 
small. We denote the replaced tree as T, and denote its weight as shm . Then, 
we have 

^ sum(W^, d*2,...,d*2)> smh(VF, d^’^^ , . . . , d[“\ d^, . . . , d^) . 
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The tree T may not be produced according to the rule that the leaf that has the 
maximum weight is to be expanded first. Thus, the sum of weights of T may be 
larger than the tree produced according to the rule. Now we have 



sum(lT, d!]^^ ^ . . . ,d 



(u) 

ku 




> sum(lT, d^j}^ , ■ ■ ■ ,d[ 



(u) 

ku 




This completes the proof. 

Now the proof of our theorem is easy. 



□ 



Proof for T/ieorem El From Lemma |H] for any sequence of distribution 
dfej , . • . , dfct € P* that is acceptable for s, we have 

sum(VF, , 4?) <sum(bF, 44 • ■ • > 4*-4 4’-y ’^ 2 ) 

kt — l 

<sum(iT,44...,4‘:4d;,...,4) 



fct_i+/ct — 2 



< ■ • ■ 

<sum(lT, (^ 2 ) ■ • ■ ) ^ 2 )- 



s-l 



□ 
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Abstract. We first present a brief survey of hardness results for training 
feedforward neural networks. These results are then completed by the 
proof that the simplest architecture containing only a single neuron that 
applies the standard (logistic) activation function to the weighted sum of 
n inputs is hard to train. In particular, the problem of finding the weights 
of such a unit that minimize the relative quadratic training error within 
1 or its average (over a training set) within 13/(31n) of its infimum 
proves to be NP-hard. Hence, the well-known back-propagation learning 
algorithm appears to be not efficient even for one neuron which has 
negative consequences in constructive learning. 



1 The Complexity of Neural Network Loading 

Neural networks establish an important class of learning models that are widely 
applied in practical applications to solving artificial intelligence tasks [13]. The 
most prominent position among successful neural learning heuristics is occupied 
by the back-propagation algorithm m which is often used for training feedfor- 
ward networks. This algorithm is based on the gradient descent method that 
minimizes the quadratic regression error of a network with respect to a training 
data. For this purpose, each unit (neuron) in the network applies a differentiable 
activation function (e.g. the standard logistic sigmoid) to the weighted sum of 
its local inputs rather than the discrete Heaviside (threshold) function with bi- 
nary outputs. However, the underlying optimization process appears very time 
consuming even for small networks and training tasks. This was confirmed by an 
empirical study of the learning time required by the back-propagation algorithm 
which suggested its exponential scaling with the size of training sets and net- 
works |3f)j . Its slow convergence is probably caused by the inherent complexity 
of training feedforward networks. 

The first attempt to theoretically analyze the time complexity of learning 
by feedforward networks is due to Judd m who introduced the so-called load- 
ing problem which is the problem of finding the weight parameters for a given 
fixed network architecture and a training task so that the network responses 
are perfectly consistent with all training data. For example, an efficient loading 

* Research supported by grants GA AS CR B2030007 and GA CR No. 201/00/1489. 

N. Abe, R. Khardon, and T. Zeugmann (Eds.): ALT 2001, LNAI 2225, pp. 92- 11051 2001. 
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algorithm is required for the proper PAC learnability (besides the polynomial 
VC-dimension that the most common neural network models possess [30|38] ). 
However, Judd proved the loading problem for feedforward networks to be NP- 
complete even if very strong restrictions are imposed on their architectures and 
training tasks |21| . The drawback of Judd’s proofs is in using quite unnatural 
network architectures with irregular interconnection patterns and a fixed input 
dimension while the number of outputs grows which do not appear in practice. 
On the other hand, his arguments are valid for practically all the common unit 
types including the sigmoid neurons. Eventually, Judd provided a polynomial- 
time loading algorithm for restricted shallow architectures m whose practical 
applicability was probably ruled out by the hardness result for loading deep 
networks m- Further, Parberry proved a similar NP-completeness result for 
loading feedforward networks with irregular interconnections and only a small 
constant number of units m- In addition, Wiklicky showed that the loading 
problem for higher-order networks with integer weights is even algorithmically 
not solvable |3H]. 

In order to achieve the hardness results for common layered architectures 
with complete connectivity between neighbor layers, Blum and Rivest in their 
seminal work considered the smallest conceivable two-layer network with only 
3 binary neurons (two hidden and one output units) employing the Heaviside 
activation function. They proved the loading problem for such a 3-node network 
with n inputs to be NP-complete and generalized the proof for a polynomial 
number of hidden units (in terms of n) when the output neuron computes log- 
ical AND |S]. Hammer further replaced the output AND gate by a threshold 
unit HD while Kuhlmann achieved the proof for the output unit implement- 
ing any subclass of Boolean functions depending on all the outputs from hidden 
nodes [23] ■ Lin and Vitter extended the NP-completeness result even for a 2-node 
cascade architecture with one hidden unit connected to the output neuron that 
also receives the inputs [21]. Megiddo, on the other hand, showed that the load- 
ing problem for two-layer networks with a fixed number of real inputs and the 
Heaviside hidden nodes, and the output unit implementing an arbitrary Boolean 
function is solvable in polynomial time |2ti) . 

Much effort has been spent to generalize the hardness results also for con- 
tinuous activation functions, especially for the standard sigmoid used in the 
back-propagation heuristics for which the loading problem is probably at least 
algorithmically solvable [2S|. DasGupta et al. proved that loading a 3-node net- 
work whose two hidden units employ the continuous saturated-linear activa- 
tion function while the output neuron applies the threshold function for di- 
chotomic classification purposes is NP-complete [5|. Further, Hoffgen showed 
the NP-completeness of loading a 3-node network employing the standard acti- 
vation function for exact interpolation but with the severe restriction to binary 
weights ng. A more realistic setting as concerns the back-propagation learning 
was first considered in [33] where loading a 3-node network with two standard 
sigmoid hidden neurons was proved to be NP-hard although an additional con- 
straint on the weights of the output threshold unit used for binary classification 
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was assumed which is satisfied e.g. when the output bias is zero. Hammer re- 
placed this constraint by requiring the output unit with bounded weights to 
respond with outputs that are in absolute value greater than a given accuracy 
which excludes a small output interval around zero from the binary classifica- 
tion [1^ . This approach also allows to generalize the hardness result for a more 
general class of activation functions than just the standard sigmoid. On the other 
hand, there exist activation functions that have still appropriate mathematical 
properties and for which the feedforward networks are always loadable m- 

Furthermore, the loading problem assumes the correct classification of all 
training data while in practice one is typically satisfied by the weights yielding 
a small training error. Therefore, the complexity of approximately interpolat- 
ing a training set with in general real outputs by feedforward neural networks 
has further been studied. Jones considered a 3-node network with n inputs, 
two hidden neurons employing any monotone Lipschitzian sigmoidal activation 
function (e.g. the standard sigmoid) and one linear output unit with bounded 
weights | 19| . For such a 3-node network he proved that learning the patterns 
with real outputs from [0, 1] each within a small absolute error 0<e<l/10 
is NP-hard implying that the problem of finding the weights that minimize the 
quadratic regression error within a fixed e of its infimum (or absolutely) is also 
NP-hard. This NP-hardness proof was generalized for polynomial number k of 
hidden neurons and a convex linear output unit (with zero bias and nonnegative 
weights whose sum is 1) when the total quadratic error is required to be within 
l/(16fc®) of its infimum (or within l/(4fc^) for the Heaviside hidden units) [1 9j . 

In addition, Vu found the relative error bounds (with respect to the error infi- 
mum) for hard approximate interpolation which are independent on the training 
set size p by considering the average quadratic error that is defined as the total 
error divided by p. In particular, he proved that it is NP-hard to find weights of 
a two-layer network with n inputs, k hidden sigmoid neurons (satisfying some 
Lipschitzian conditions) and one linear output unit with zero bias and positive 
weights such that for a given training data the relative average quadratic error 
is within a fixed bound of order 0{l/{nk^)) of its infimum 1371 . Moreover, for 
two-layer networks with k hidden neurons employing the Heaviside activations 
and one sigmoid (or threshold) output unit, Bartlett and Ben-David improved 
this bound to 0{l/k^) which is even independent on the input dimension jl]. In 
the case of the threshold output unit used for classification, DasGupta and Ham- 
mer proved the same relative error bound 0(1/ k^) on the fraction of correctly 
classified training patterns which is NP-hard to achieve for training sets of size 
^(,3.5 < p < related to the number k of hidden units j7]. They also showed that 
it is NP-hard to approximate this success ratio within a relative error smaller 
than 1/2244 for two-layer networks with n inputs, two hidden sigmoid neurons 
and one output threshold unit (with bounded weights) exploited for the classifi- 
cation with an accuracy 0 < e < 0.5. On the other hand, minimizing the ratio of 
the number of misclassified training patterns within every constant larger than 
1 for feedforward threshold networks with zero biases in the first hidden layer is 
NP-hard 0. 
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The preceding results suggest that training feedforward networks with fixed 
architectures is hard indeed. However, the possible way out of this situation 
might be the constructive learning algorithms that adapt the network architec- 
ture to a particular training task. It is conjectured that for a successful gen- 
eralization the network size should be kept small, otherwise a training set can 
easily be wired into the network implementing a look-up table |34| . A construc- 
tive learning algorithm usually requires an efficient procedure for minimizing the 
training error by adapting the weights of only a single unit that is being added to 
the architecture while the weights of remaining units in the network are already 
fixed (e.g. m)- Clearly, for a single binary neuron employing the Heaviside acti- 
vation function the weights that are consistent with a given training data can be 
found in polynomial time by linear programming provided that they exist (al- 
though this problem restricted to binary weights is NP-complete |2S! and also to 
decide whether the Heaviside unit can implement a Boolean function given in a 
disjunctive or conjunctive normal form is co-NP-complete m)- Such weights do 
not often exist but a good approximate solution would be sufficient for construc- 
tive learning. However, several authors provided NP-completeness proofs for the 
problem of finding the weights for a single Heaviside unit so that the number of 
misclassified training patterns is at most a given constant [TTTEITI which remains 
NP-complete even if the bias is assumed to be zero \m- In addition, this issue 
is also NP-hard for a fixed error that is a constant multiple of the optimum . 

Hush further generalized these results for a single sigmoid neuron by showing 
that it is NP-hard to minimize the training error under the L\ norm strictly 
within 1 of its infimum m- He conjectured that a similar result holds for 
the quadratic error corresponding to the L 2 norm which is used in the back- 
propagation learning. In the present paper this conjecture is proved. In partic- 
ular, it will be shown that the issue of deciding whether there exist weights of 
a single neuron employing the standard activation function so that the total 
quadratic error with respect to a training data is at most a given constant is 
NP-hard. The presented proof also provides an argument that the problem of 
finding the weights that minimize the relative quadratic training error within 

1 or its average within 13/(31n) of its infimum is NP-hard. This implies that 
the popular back-propagation learning algorithm may be not efficient even for a 
single neuron and thus has negative consequences in constructive learning. For 
the simplicity, we will consider only the standard sigmoid in this paper while in 
the full version we plan to reformulate the theorem for a more general class of 
sigmoid activation functions. 

2 Training a Standard Sigmoid Neuron 

In this section the basic definitions regarding a sigmoid neuron and its train- 
ing will be reviewed. A single (perceptron) unit (neuron) with n real inputs 
Xi, ... ,Xn S IR first computes its real excitation 



n 




( 1 ) 
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where w = (wq, . . . , Wn) G is the corresponding real weight vector includ- 

ing a bias wg- The output y is then determined by applying a nonlinear activation 
function cr to its excitation: 

y = <^iO- ( 2 ) 

We fix a to be the standard (logistic) sigmoid: 






1 

1 -I- 



( 3 ) 



which is employed in the widely used hack-propagation learning heuristics. Cor- 
respondingly, we call such a neuron the standard sigmoid unit. 

Furthermore, a training set 

T = {{Xk,dk)\ Xk = {Xki,. ■ -,Xkn) G K”, dk G [0, 1], fc = 1, . . . ,p} (4) 



is introduced containing p pairs — training patterns, each composed of an n- 
dimensional real input Xk and the corresponding desired scalar output value 
dk from [0, 1] to be consistent with the range of activation function Given a 
weight vector w, the quadratic training error 



Et{w) = {y{w, Xk) - dk)"^ = 











( 5 ) 



of a neuron with respect to the training set T is defined as the difference between 
the actual outputs y{w, Xk) depending on the current weights w and the desired 
outputs dk over all training patterns k = 1, ... ,p measured by the L 2 regression 
norm. The main goal of learning is to minimize the training error in the 
weight space. The decision version for the problem of minimizing the error of 
a neuron employing the standard sigmoid activation function with respect to a 
given training set is formulated as follows: 



Minimum Sigmoid-Unit Error (MSUE) 

Instance: A training set T and a positive real number e > 0. 
Question: Is there a weight vector w G such that Et{w) < el 



3 Minimizing the Training Error Is Hard 

In this section the main result that training even a single standard sigmoid 
neuron is hard will be proved: 

Theorem 1. The problem MSUE is NP-hard. 

Proof. In order to achieve the NP-hardness result, a known NP-complete prob- 
lem will be reduced to the MSUE problem in polynomial time. In particular, 
the following Feedback Arc Set problem is employed which is known to be NP- 
complete p7| : 
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Feedback Arc Set (FAS) 

Instance: A directed graph G = {V,A) and a positive integer a< |A|. 

Question: Is there a subset A' C A containing at most a > \A'\ directed edges 
such that the graph G' = {V,A \ A') is acyclic? 

The FAS problem was also exploited for a corresponding result concerning the 
Heaviside unit |2^. However, the reduction is adapted here for the standard 
sigmoid activation function and its verification substantially differs. 

Given a FAS instance G = {V, A), a, a corresponding graph Gr = (Vr, Ar) is 
first constructed so that every directed edge (m, u) G A in G is replaced by five 
parallel oriented paths 

^{u,v).h — {(^; : {Uy ^ Uyfil')^ Uyfi2^: ■ • ■ : {'^vh,r—l: ^)} (b) 

for ft, = 1, . . . , 5 in Gr sharing only their first edge (m, tt„) and vertices rt, Uy, v. 
Each path P(u,v),h includes 

r = 8a + 6 (7) 

additional vertices Uy,Uyhi,Uyh 2 , ■ ■ ■ ,Uyh,r-i unique to (u,v) G A, i.e. the sub- 
sets of edges 

5 

^{u,v) P{u,v),h (S) 

h=l 

corresponding to different (u,v) G A are pairwise disjoint. Thus, 

Vy = V\J {Uy\ (U,V) G A} 

U \^Uy fii , Uyfi2 , . . . , Uyfi^r—1 ^ {u^v^ G A^ H — 1,...,5} (9) 

Ay = . (10) 

{u,v)^A 

It follows that n = \Vr\ = |F|-l-(5r — 4)|A| and s = |Ar| = (5r-|-l)|A|. Obviously, 
the FAS instance G, a has a solution iff the FAS problem is solvable for Gy, a. The 
graph Gy is then exploited for constructing the corresponding MSUE instance 
with a training set T(G) for the standard sigmoid unit with n = (40a-|- 26)|A| -|- 
\V\ = 0(|Ap -I- |E|) inputs: 

T(G) = {(a;(ij),l) , (-a;(i_j),0) ; (i,j) G Ay, 

(^{71)4’ ■ • ■ 5 ^ { 1; b, 1} I (11) 

that contains p = 2s = (80a -I- 62)|A| = 0(|Ap) training patterns, for each edge 
(i,j) G Ay one pair 1), (—X(^ij'j,0) such that 

r — 1 for £ = i 

for £ = j ^=l,...,n, {i,j)GAy. (12) 

[ 0 for £ i,j 

In addition, the error in the MSUE instance is required to be at most 



6 — 2a -t“ 1 . 



(13) 
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Clearly, the present construction of the corresponding MSUE instance can be 
achieved in a polynomial time in terms of the size of the original FAS instance. 

Now, the correctness of the reduction will be verified, i.e. it will be shown 
that the MSUE instance has a solution iff the corresponding FAS instance is 
solvable. So first assume that there exists a weight vector w G such that 

(14) 



Define a subset of edges 

A' = {(m, u) € A; > Wv} C A (15) 

in G. First observe that graph G" = (U, A \ A') is acyclic since each vertex 
u G V C lA is evaluated by a real weight G IR so that any directed edge 
(u, ?;) G A \ A' in G' satisfies Wu < Wy. 

Moreover, it must be checked that \A'\ < a. For this purpose, the error 
Et(g){'^) introduced in ((2D is expressed for the training set T{G) by using (IT^ 
and (uni as follows: 



Et{G){'w) = ^ {a{wo - Wi + Wj) - 1)^ + ^ a'^{wo + w^-Wj) 

= ^ ^ {a‘^{-wo + - Wj) + a^{wo + Wi - Wj)) (16) 

(u,v)eA 



where the property cr(— C) = 1 — o'(C) of Aie standard sigmoid ([3D is employed. 
This error is lower bounded by considering only the edges from A! C A: 

Et{G){w) > ^ EAf^y^y-^ (17) 

{u,v)^A' 



where each term if A(„ for (m, v) G A' will below be proved to satisfy 

EA(u^y)= ^ [a‘^{-W0 + Wi-Wj)+a‘^{wQ + Wi-Wj)) > (18) 

Clearly, e.g. Wq > 0 can here be assumed without loss of generality. For each 
(u, v) G A! let P(u,v) bs ^ path with the minimum error 

EP{u,v)= X! {(J^{-wo + Wi-Wj)+a'^{wo + Wi-Wj)) (19) 



among paths P(u,v),h for h= 1, ... ,5. Furthermore, sort the edges (i,j) G P{u,v) 
with respect to associated decrements Wi — wj in nonincreasing order and denote 
by (c, d), (e, /) G P(u,v) tbe first two edges, respectively, in the underlying sorted 
sequence, i.e. Wc~ Wd > We — Wf > Wi — wj for all (i, j) G P(u,v) \ {(c, d), (e, /)}. 
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First consider the case when wo + We — Wf > In 2, i.e. a'^^wo + We — Wf) > 4/9 
according to It follows from definition of P(u,v) and ([8]) that 

E^{u,v)^ ^ {(j‘^{—wo + Wi — Wj) + a‘^{wo + Wi — Wj)) 

>5- ^ {a'^{-wo + Wi - Wj) + a'^{wo + Wi - Wj)) 

20 £ 

> 5 • CT^(wo + We- Wf) > 7T > — (20) 

9 a + 1 

since P(u,v) \ {(w, ««)} contains an edge {i, j) G {(c, d), (e, /)} with cr^(tco + Wi — 
Wj) > cf‘^{wq + We — Wf) by definition of (e, /) due to cr^ is increasing. This 
proves inequality (HSJ for Wo + We — w/ > In 2. 

On the other hand suppose that wq + We — w/ < In 2. In this case vertices i G 
Vr on path P(u,v) {{u,v) G A') will possibly be re-labeled with new weights w' G 
IR except for fixed w„,w„ so that there is at most one edge (c,d) G P(u,v) with 
a positive decrement Wc — Wd > 0 or all the edges (i,j) G P(u,v) are associated 
with nonnegative decrements w^ — Wj > 0 while the error introduced 

in (SSI) is not increased. Note that error depends only on decrements 

Wi — Wj rather than on the actual weights Wi, Wj. For example, these decrements 
can arbitrarily be permuted along path P{u,v) producing new weights whereas 
EP(u,v) and Wu,Wy do not change. Recall from definition of (c, d), (e, /) that for 
all {i,j) G P(u,v) \ {(c,d)} it holds 



— Wq + Wi — Wj < Wq + Wi — Wj < Wq + We ~ W f < In 2 . (21) 

Now, suppose that there exists an edge (i,j) G P(u,v) \ {(c, d)} with a positive 
decrement 0 < Wi~Wj < Wc~Wd together with an edge (£, m) G P(u,v) associated 
with a negative decrement Wf — Wm < 0. Then these decrements are updated as 
follows: 



w'i — Wj = Wi — Wj — A (22) 

w'^ - w'^ = we - Wm + A (23) 



where A = min(wi — Wj, Wm — wi) > 0. This can be achieved e.g. by permuting 
the decrements along path P(u,v) so that wt — Wj follows immediately after 
We — Wm (this produces new weights but preserves „)) and by decreasing 
the weight of the middle vertex that is common to both decrements by Z\ which 
clearly influences error However, for ^ < ln2 the first derivative (cr^)' is 

increasing because 






2e-« (2e-« - l) 

(l + e-«)" 



> 0 



(24) 



for ^ < ln2 according to (0)- Hence, 
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cr^( — u>0 + Wi— Wj) + (T^( — Wo + Wt — Wm) > cr^( — Wo +Wi— Wj — A) 

+cr^(-wo + w^ - Wm + Z\) (25) 

Cr^(wo + Wi — Wj) + (T^(wo + W^ — Wra) > Cr^ {wq + Wi — Wj — A) 

+a‘^ {wo + Wi - Wm + A) (26) 

according to (HD. This implies that error only decreases while w( — w'j = 

0 or w^ — w'm = 0. By repeating this re-labeling procedure eventually at most 
one positive decrement Wc — Wd > 0 remains or all the negative decrements are 
eliminated. 

Furthermore, 

Wc - Wd -I- ^ {wi - Wj) = Wu - Wy > 0 (27) 

(hi)6f(«,«)\{(c,d)} 

due to (u, v) G A! which implies 

Wd - Wc< ^ (wj - Wj) . (28) 



Thus, 



Wd - Wc < Wi - Wj 



(29) 



can be assumed for all (i,j) G P(u,v) since the decrements Wi — Wj for {i,j) G 
P(u,v) \ {(c, d)} in sum (1281) can be made all nonpositive or all nonnegative. 
According to (l29l) inequality would follow from 



EA(y^y) > EPl^y^y) P CT ^ ( ~ WO P W C - W d) P T ' Ct‘^{-W0 “h Wrf ~ Wc ) 

Pa'^{wo Pwc- Wd) P r ■ cr^(wo -I- Wd - Wc) > ^ ^ (30) 

because there are r edges (*, j) on path P(u,v) except for (c, d) and is increas- 
ing. The particular terms of addition (11101) can suitably be coupled so that it 
suffices to show 

a^(e) + r-a^(-a>xw^ (31) 

2(a -I- 1) 

for any excitation ^ G IR. For this purpose, a boundary excitation 



= In = In (2a + 1 + + 6a + 2 ) (32) 



is derived from (0, dI3 such that 






e 

2(a -|- 1) 



(33) 



Thus, cr^(^) > cr^(C&) for ^ due to is increasing which clearly implies 
dSU for ^ > ^b according to ( 100) . For ^ < ^b, on the other hand, it will even be 
proved that 



a'^iO + r- cr^(-C) > 1 > 



£ 



which reduces to 



2(a + 1) 



(34) 
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^(-e) 



2e« + 1 



by using ( 0 . Moreover, it is sufficient to verify m only for ^ i.e. 



(35) 



r > 2e^‘‘ + 1 



(36) 



since 2e^ + 1 is increasing. Inequality can be checked by substituting ([7|) for r 
and for which completes the argument for and consequently for 
Finally, by introducing d and (IT^ into inequality (flTl) it follows that 

£ > Et(g){'w) > \^'\ • ^ ^ (37) 

which gives |A'| < o + 1 or equivalently \A'\ < a. This completes the proof that 
A' is a solution of the FAS problem. 

On the other hand, assume that there exists a solution A' C A of the FAS 
instance containing at most a > \A'\ directed edges making graph G' = {V, A\A') 
acyclic. Define a subset 



A(, = {(u,u„); {u,v) e A'} 



(38) 



containing |A(,| = \A'\ < a edges from A^.. Clearly, graph G(, = {Vr,Ar \ A(.) 
is also acyclic and hence its vertices i G K can be evaluated by integers w[ so 
that any directed edge (i, j) G Aj.\ AJ, satisfies w[ < w'y Now, the corresponding 
weight vector w is defined as 

Wi = K ■ w[ (39) 

for i G Vr where AT > 0 is a sufficiently large positive constant, e.g. 



AT = In {y/p — 1) = In — 1^ 



(40) 



(recall p = |T(G)| = 2s where s = jA^I) while Wq = 0 which will be proved to 
be a solution for the MSUE instance. The error (11611 can be rewritten for w: 



Et{G)(w) = 2a^{wi - Wj) 

= 2 a‘^{wi — Wj)+2 Y^ a'^iwi — Wj). (41) 



For (j, j) G Ar \ A(, it holds 



Wi — Wj = K{w'i — w'j) < —K < 0 (42) 

according to m where w[ — Wj < —1 due to w', w' are integers. This implies 

a^{w,-Wj) < a'^i-K) = ^ 



(43) 
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for {i,j) e Ar\A’^ by formulas dU, (l40l) due to tr^ is increasing. Hence, the error 
(SB can be upper bounded as 

Et(G){w) <2\A’^\ + 2sa'^{-K) <2a+l = e (44) 

by using \A'^\ < a, cr^(^) < 1, and \Ar \ A’^\ < s. Therefore w is a, solution of the 
MSUE problem. This completes the proof of the theorem. □ 

The proof of Theorem [T] also provides the NP-hardness result regarding the 
relative (average) error bounds: 

Corollary 1. Given a training set T eontaining p = |T| training patterns, it is 
NP-hard to find a weight vector w G of the standard sigmoid neuron with 

n inputs for which the quadratic error Et{w) with respect to T is within 1 of 
its infimum, or the average quadratic error Et{w)/p is within 13/(31n) of its 
infimum. 

Proof. Given a FAS instance G = {V,A), a, a corresponding MSUE instance 
T{G), e is constructed according to (ITT]) . (IT3I) in polynomial time. Assume that 
a weight vector w* G could be found such that 

Et(G) ( w*) < inf Ej’(^c^{w) + 1 . (45) 

The corresponding subset of edges A* C A making graph G* = (U, A \ A*) 
acyclic can be then read from w* according to (EJ. It will be proved in the 
following that |A*| < a iff the original FAS instance has a solution. This means 
that finding the weight vector w* that satisfies (1451 ) is NP-hard. 

It suffices to show that for |A*| > a -I- I there is no subset A' C A such that 
I A' I < a and G' = (U, A\ A') is acyclic since the opposite implication is trivial. 
On the contrary suppose that such a subset A' exists. It follows from (|45l) . (ITTll . 
and (|T!?1) that 

inf Et{G){w) > Et(g){w*) - 1 > |A*| • - 1 > 2a . (46) 

On the other hand, a weight vector w' G IR"”'’^ corresponding to subset A' C A 
could be defined by (i39l) that would lead to an error 

Et(G) (w') < 2a + 2s (7^ (-K) (47) 

according to (ITTl) . However, from m, (0, and JTHI) there exists K > 0 such that 

2s a'^{—K) < —2a + inf Et(g){'^) < P = (48) 

which provides a contradiction Et(^q^{w') < inf^gjp{„+i Et(^q^{w) by using (l47l) . 

Finally, it follows from the underlying reduction that approximating the av- 
erage quadratic error Et{w)/p within 13/(31n) of its infimum is also NP-hard 
due to p < 31ri/13. □ 
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4 Conclusions 

The hardness results for loading feedforward networks are completed by the proof 
that the approximate training of only a single sigmoid neuron, e.g. by using the 
popular back-propagation heuristics, is hard. This suggests that the constructive 
learning algorithms that minimize the training error gradually by adapting unit 
by unit may also be not efficient. In the full version of the paper we plan to 
formulate the conditions for a more general class of sigmoid activation functions 
under which the proof still works. 
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Abstract. This paper concerns the design of a Support Vector Machine 
(SVM) appropriate for the learning of Boolean functions. This is moti- 
vated by the need of a more sophisticated algorithm for classification in 
discrete attribute spaces. Classification in discrete attribute spaces is re- 
duced to the problem of learning Boolean functions from examples of its 
input/output behavior. Since any Boolean function can be written in Dis- 
junctive Normal Form (DNF), it can be represented as a weighted linear 
sum of all possible conjunctions of Boolean literals. This paper presents a 
particular kernel function called the DNF kernel which enables SVMs to 
efficiently learn such linear functions in the high-dimensional space whose 
coordinates correspond to all possible conjunctions. For a limited form of 
DNF consisting of positive Boolean literals, the monotone DNF kernel 
is also presented. SVMs employing these kernel functions can perform 
the learning in a high-dimensional feature space whose features are de- 
rived from given basic attributes. In addition, it is expected that SVMs’ 
well-founded capacity control alleviates overfitting. In fact, an empirical 
study on learning of randomly generated Boolean functions shows that 
the resulting algorithm outperforms C4.5. Furthermore, in comparison 
with SVMs employing the Gaussian kernel, it is shown that DNF kernel 
produces accuracy comparable to best adjusted Gaussian kernels. 



1 Introduction 

In this paper. Support Vector Machines (SVMs) |15I4] is applied to classifica- 
tion in discrete attribute spaces with the aim of overcoming difficulties involved 
in existing algorithms. Classification, which is a primary data mining task, is 
learning a function that maps data into one of several predefined classes. Espe- 
cially, numerous studies have been made in a specific framework where data are 
described by a fixed set of attributes and their discrete values. Since the frame- 
work can be reduced to the learning of Boolean functions, this paper concerns 
the design of SVMs appropriate for the learning of Boolean functions. 

For classification in discrete attribute spaces, C4. 5 m is one of the most 
widely used learning algorithms. However, several problems causing a decrease 
in accuracy have been pointed out. 

One problem is that its way of overfitting avoidance is not always effective. 
In order to prevent decision trees from overly fitting to a given training data and 
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decreasing accuracy for unseen data, C4.5 prunes overly complex decision trees. 
However, as discussed in the literature [6], this way of overfitting avoidance is 
not supported theoretically, and is shown not to be always effective as a practical 
heuristic. 

Another problem stems from its univariate node-splits strategy: constructing 
a decision tree whose nodes are split by a single attribute most relevant to class 
membership at each node. This strategy is based on the assumption that every 
attribute constituting decision rules are relevant to class membership and thus 
the rules can be obtained by collecting such attributes one by one. However, 
the assumption is not always true under the information theoretical relevancy 
measure employed in C4.5. There is a case that an attribute is not relevant to 
class membership by itself though it has high relevancy when other attributes’ 
values are known. The literature m illustrates that, in the multiplexer family 
of tasks, address bits, which are important attributes for class decision, show 
no relevancy to class membership, and this phenomenon leads to inaccurate 
decision trees. One way to overcome this problem is to use feature construction 
0: creating new features by combining some attributes and splitting nodes by 
the newly created features. However, due to a large number of combinations of 
attributes, it is infeasible to select a feature most relevant at each node. 

To cope with these problems, this paper applies SVMs to classification in 
discrete attribute spaces. SVMs adopt a well-founded approach to overfitting 
avoidance: minimizing a statistical bound of the error rate for unseen data. Fur- 
thermore, SVMs provide a method of efficient learning in a feature space con- 
sisting of a large number of features derived from basic attributes. It is expected 
that these capabilities of SVMs deliver a good performance on classification in 
discrete attribute spaces. 

A characteristic of SVMs is that target functions are not learned directly 
in an input space but in a feature space whose features are derived from ba- 
sic attributes. For the learning of Boolean functions, it is reasonable to use the 
feature space whose features are all possible conjunctions of negated or non- 
negated Boolean variables. This is because any Boolean function can be written 
in Disjunctive Normal Form (DNF) and thus the function can be represented as 
a weighted linear sum of the features as described in Section |H The linear func- 
tion in the feature space is learned by SVMs from examples of its input/output 
behavior. Although the learning seems to be computationally infeasible because 
of high dimensionality of the feature space, e.g. 3'^ — 1 for a d-variable Boolean 
function, SVMs can perform efficient learning in the high dimensional feature 
space with the help of a kernel function. A kernel function computes the inner 
product in a feature space and the use of it allows SVMs to deal with an al- 
ternative representation of the target function which does not depend on the 
dimension of the feature space. This paper presents a particular kernel func- 
tion called the DNF kernel that enables efficient learning in the feature spaces 
consisting of all possible conjunctions. For a limited form of DNF consisting of 
positive Boolean literals, the monotone DNF kernel is also presented. 
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To explore the capabilities of SVMs employing the DNF kernel, experiments 
on learning of randomly generated Boolean functions are performed. The exper- 
iments show that the resulting algorithm produces higher accuracy than C4.5 
does. Furthermore, in comparison with SVMs employing the Gaussian kernel, 
which is a standard choice of kernels, it is shown that the DNF kernel produces 
accuracy comparable to best adjusted Gaussian kernels. 



2 The Learning Task 



In principle, classification in a discrete attribute space can be reduced to the 
learning of Boolean functions. Firstly, an n-class classification task is reduced to 
n 2-class classification tasks of discriminating each class from the other classes. 
Secondly, by assigning a Boolean variable Xik to the proposition Ai = Vik for 
each value Vik of an attribute Ai (1 < k < £i), each 2-class classification task 
can be reduced to the learning of Boolean functions / : {0, ^ {0, 1}, where 

d= 

Furthermore, the learning of Boolean functions can be generally stated as 
binary classification [^: 

given a training set S C X x Y and a hypothesis space H, where let X be an 
input space and F = {P,N} a set of class labels, 
find g € H such that g minimize the generalization error eTTx>{g). 

The generalization error is the expected misclassification rate for a distribution 
T> over X x Y which is defined as 



err-oig) 



def 



J p{x,y)L{g{x),y)dxdy 



using zero-one loss 



L{g{x),y) = 



1 g{x) yf y 

0 otherwise 



From the practical viewpoint, the minimization of the generalization error 
cannot be performed because it depends on an impractical assumption that a 
probability distribution over V x F is known. Therefore, instead of minimizing 
it, practical learning machines minimize the following training-set error errg: 



errs (5) 



def 



1 



^i9{x),y). 

{x,y)es 



However, minimizing the training-set error causes a well known problem over- 
fitting: the risk that learning machines select a hypothesis that fits the training 
data well but captures the underlying model poorly. As a result of the overfitting, 
learning machines produce high generalization error even though they produce 
low training-set error. 
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C4.5 avoids overfitting by pruning of decision trees, that is, given two trees 
with the same training-set error, it prefers simpler one based on the assump- 
tion that overfitting is caused by overly complex trees. This way of overfitting 
avoidance is known as Occam’s razor and is widely used. However, this empirical 
wisdom is not supported theoretically, and is shown not to be always effective 
as a practical heuristic |6]. In the literature, the author argues that overfitting 
arises not because of complexity, but because testing a large number of hypothe- 
ses leads to a high probability of finding a hypothesis that fits training data well 
purely by chance. 

The cardinality of the hypothesis space of a learning machine is referred to as 
capacity, and preventing overfitting by allowing just the right amount of capacity 
is known as capacity control. As we will see in the next section, SVMs control 
their capacity by minimizing a statistical bound of the generalization error based 
on the statistical learning theory m- Within the limited hypothesis space by 
the minimization, SVMs find a hypothesis with a low training-set error, which 
is also expected to produce a low generalization error. 

Instead of searching for a hypothesis g G H directly, SVMs search for the 
following real-valued function / in order to use continuous optimization tech- 
niques. 

/W>o 

I N fix) < 0 

Such a function / is called a discriminant function. 

As a notational convenience, SVMs use the set {-1-1, —1} of class labels. Ac- 
cordingly, positive examples of a Boolean function are labeled -1-1 and negative 
ones —1. That is, in the case of the learning of d-variable Boolean functions, Y 
is set to j-l-1, —1} and X is set to {0, 1}'^. 

3 Support Vector Machines 

This section serves as a brief introduction to the learning principle of SVMs. For 
a more complete introduction, consult the literatures |15|4J . 

SVMs learn non-linear discriminant functions in an input space. This is 
achieved by learning linear discriminant functions in a high-dimensional feature 
space. A feature mapping (/) from the input space to the feature space maps the 
training data S = into (j){S) = {(^(a:*), 2/*)}”=! = 2/*)}r=i- In 

the feature space, SVMs learn a linear discriminant function f{z) = {w ■ x)+b so 
that the hyperplane f{z) = 0 separates the positive examples {zi | yi = 1} from 
the negative ones {zi \ yi = —1}. For any hyperplane f{z) = {w ■ z) + b = 0, the 
Euclidean distance of the closest point 2 :* S {zi, ... , Zn} is called the margin of 
the hyperplane. If we normalize hyperplanes so that |/( 2 :*)| = 1, then the mar- 
gin of the hyperplane is |p 7 j|y- Among normalized hyperplanes, SVMs find the 
maximal margin hyperplane that has the maximal margin and separates positive 
examples from negative ones. Thus, the learning task of SVMs can be stated as 
the following convex quadratic programming problem: 
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Fig. 1. The maximal margin hyperplane in a feature space 



minimize Il'tt’lPi (1) 

subject to Uif{zi) > 1 (1 < i < n). (2) 



The choice of the maximal margin hyperplane is justified by Theorem 4.18 

in [4]. 

Theorem 1. ffj^) For any probability distribution T> on Z xY and sujjiciently 
large n, if yif{zi) > 7 ||ir|| holds for all {zi,yi)(l < i < n) and 7 > 0 , then the 
following inequality holds with probability 1 — (5 



err-D(/) < 



2 

n 



f 64i?^ ^ eu'y , 32n ^ 4\ 



where R is the radius of a sphere that eontains Z . 

Because 7 ||iu|| = 1 in our context, the minimization of ||ie|P amounts to the 
minimization of the statistical bound of the generalization error. 

According to the optimization theory, the above convex quadratic program- 
ming problem is transformed into the following dual problem: 



maximize 



subject to 



n ^ n n 

i—1 i=l j—1 

n 

> 0 (1 < I < n), ^ aiPi = 0, 
i=l 



(3) 

(4) 



where parameters are called Lagrange multipliers. It is known that the above 
convex quadratic programming can be solved efficiently W- For a solution 
a\,.. . ,a^, the maximal margin hyperplane f*{z) = 0 can be expressed in the 
dual representation in terms of these parameters: 



f*{z) = Y^^a*y,{zi- z) + b* 



(5) 
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b* =ys-^a* yi {zi ■ Zs) for some a* 0 (6) 

i=l 

An advantage of using the dual representation is that we can side-step eval- 
uation of the feature mapping cj), which is infeasible when the dimension of the 
feature space is quite high. Notice that, in the dual representation, the feature 
mapping (j) appears only in the form of inner products 

{z, ■ Zj) = {4>{Xi) ■ 4>{Xj)) . 

Therefore, if we have a way of computing the inner product in the feature space 
directly as a function of the input points, i.e. 

Xj) = {(j){x,) ■ 4>{xj )) , 

then we can side-step the computational problem inherent in evaluating the 
feature mapping. Such functions K are called kernel functions. The use of kernel 
functions makes it possible to map the data implicitly into a high dimensional 
feature space and to find the maximal margin hyperplane in the feature space. 

The next section considers a feature space appropriate for learning Boolean 
functions and a kernel function for the feature space. 

4 The DNF Kernel 

For the learning of a Boolean function, it is desirable that a feature space is 
pertinent to the function enough to linearly separate its positive examples from 
negative ones. In this section, we considers a feature space whose features are 
all conjunctions of negated or non-negated Boolean variables. For instance, the 
following 3^ — 1 dimensional feature space is used for 2-variable Boolean func- 
tions. 



Xi, X2, 1-3^2, 3^13^2, 3;i(l-X2), (1-Xi)a:2, (1 - 3;i)(l - 3 : 2 ) 

The feature space is formally defined as follows. 

Definition 1. Let idx be a bijection from {D, 1, 0}'^ to {0, . . . , 3^^ — 1} such that 
idx(D, . . . , D) = 0. For any Boolean variable x and any p G {D,1,0}, L is 
defined as follows. 



i x p = 1 

1 — 3; p = 0 

1 p=D 

Definition 2. (j){x) {(j)i{x), . . . , (j)i{x)), where £ = 3^^ — I and 

d 

= Y[Hxj,pj), idx"\i) = {pi,...,pd). 
3=1 
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In the feature space induced by (j) defined above, this paper considers hyper- 
planes with the zero threshold. The following proposition shows that such a class 
of hyperplanes is sufficient for separating the data of any Boolean function. 

Proposition 1. For any Boolean function g{x), there exists a hyperplane 
f(^) = 5Zi=i ^ WiZi = 0 that satisfies the following conditions for any x G 

g{x) = 1 f{(f{x)) = 1 and g{x) =0 4^ f{(!>{x)) = -1 

Proof. Let us consider the hyperplane WiZi = 0 where 

( 1 M G {0, 1}"^ and g{u) = 1 
u = idx“^(i), tCj = < — 1 u G {0, l}"^ and g{u) = 0 
[ 0 otherwise. 

For this hyperplane, we can show that Wi<f>i{x) yf 0 iff i = idx(a:), and thus 
f{(t){x)) = w^<pi{x). 

From this fact, we see that the first condition holds as follows. If g(x) = I 
then Wi = 1 and (f>i{x) = I holds from the definition of Wi and 4>i. Therefore, 
f{(j){x)) = Wi4>i{x) = 1. Conversely, if f{(j){x)) = Wi(fi{x) = 1 then Wi must be 
1, and thus g{x) = 1. 

In the same way, we see that the hyperplane satisfies the second condition. 

□ 

According to the way of construction of hyperplanes above, we obtain the 
following hyperplane separating the data of the Boolean function x\ V x^- 

f{(l){xi,X2)) = xiX2 + a;i(l - X2) - (1 - xi)x2 + (1 - a;i)(l - ^2) = 0 

As the way of construction shows, the conjunctions with a smaller length than 
d are not necessarily required for the separability. Such features are required for 
generalization ability of hyperplanes. For example, in a feature space consisting 
of conjunctions with length d, the following hyperplane has the maximal margin 
for the training data {((1, 1), 1), ((0, 1), —1), ((0, 0), 1)}. 

f*{(l){xi,X2)) = X 1 X 2 - (1 - xi)x2 + (1 - a;i)(l - 0 : 2 ) = 0 

However, this hyperplane cannot classify x' = (1, 0) because f*{(f>{x')) = 0. That 
is, SVMs using this feature space are a rote learner. On the other hand, in the 
feature space consisting of all possible conjunctions, the following hyperplane 
has the maximal margin for the same data. 

f*{4>{xi,X2)) = azxi + (oi -I- a2)(l - xi) + (02 + o;3)a;2 
— X2) + 02(1 — a;i)x 2 

- xi)(l - X 2 ) + a-iXiX2 = 0, 

where a\ « 0.57, «2 ~ —0.71,03 « 0.57. Note that this hyperplane gains gener- 
alization ability, i.e. f*{(f>{x')) « 1. 
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Because the above representation of hyperplanes have — 1 terms for d- 
variable Boolean functions, it is difficult to deal with the representation directly 
from a computational viewpoint. As we have seen in the previous section, the 
use of kernel functions in the dual representation resolves the difficulty. In the 
following, we consider a particular kernel function for the feature space induced 
by the feature mapping </>. 

Definition 3 (The DNF Kernel). 



d 

K{u, v) —1 + — Uj — Vj + 2 ). 

i=i 

The following theorem says that AT is a kernel function. 

Theorem 2. {4>{u) ■ 4>{v)) = K{u,v). 

3 ”^-! 

Proof. {(j){u) ■ (j){v)) + 1 = X! + 1 

d d d 

= Y[Hu3,Pj)Y[HVj,Pj) = Y[Hu3,P3)HVj,Pj) 

(pi,...,Pd)G{r>,i,o}‘^i=i i=i (pi....,pd)i=i 

= {L(mi,1)L(ui, 1) + L(ui,0)L(ui,0) + L(ui,D)L(?;i,i:))} 

d 

E n L{uj,pj)L{vj,P3) 

(p2,....pd) i=2 
d 

= {2uiVi - ui - vi + 2) ■ ^ Y\lj{uj,pj)lj{vj,pj) 

(p2,....pd) i=2 

d 

= — Uj — Vj + 2) = K{u, u) + 1 

i=i 



□ 

We should notice that the computational complexity of K depends on the 
dimension d of the input space and not on the dimension — 1 of the feature 
space. Therefore, the use of the kernel function enables efficient learning in the 
high dimensional feature space. 

In some application, negation of Boolean variables is not necessary needed. 
We can limit the expressive power by using the feature space whose features 
are all conjunctions of positive Boolean literals. By considering only L(a;, 1) and 
\j{x,D) in the proof of Theorem [3 we easily see that the following function is a 
kernel function for the feature space. 
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Definition 4 (The Monotone DNF Kernel). 

d 

Km{u,v) -1 + Y]_{UjVj + 1 ). 

1=1 

The DNF kernel and the monotone DNF kernel are applicable to points 
on TZ‘^. By restricting their domains to {0,1}^^ x {0,1}'^, their computation is 
simplified as follows. For any u,v G {0,1}'^, we denote by same01(M,t>) the 
number of bits that have the same value in u and v. In addition, samel (n, v) 
denotes the number of active bits common to both u and v. 

Proposition 2. For any u,v G {0, 1}'^, 

K(n,-*;) = -l + 2same01(M,-u) ,;) = -1 + 

These kernel functions are independently discovered by jS]. 

5 Experiments 

To explore the capabilities of the learning method described above, I conducted 
experiments on learning of random Boolean functions in the same way as in [Q . 
The experimental system SVM+DNF implements a SVM using the DNF kernel 
that finds maximal margin hyperplanes with the zero threshold. To find the max- 
imal margin hyperplane with the fixed threshold, it uses the stochastic gradient 
ascent algorithm described in [H Table 7.1] as a convex quadratic programming 
solver. 

The random d-variable Boolean functions were generated in disjunctive nor- 
mal form as follows. Each variable was included in a disjunct with probability ^ 
and negated with probability Therefore, the average length of disjuncts is a. 
The number of disjuncts was set to 2““^ so as to produce approximately equal 
numbers of positive and negative examples. In the following experiments, d was 
set to 16 and a was set to 8. 

For each Boolean function, n training data and 1000 test data were inde- 
pendently drawn from the uniform distribution. After a learning algorithm was 
trained using the training data, the misclassification rate of the learning algo- 
rithm was measured for the test data. The misclassification rate was averaged 
across 200 different Boolean functions. 



5.1 Comparison with C4.5 

Figure |2] illustrates dependency of the misclassification rates on sample size n, 
where SVM-I-DNF is compared with C4.5 and Naive Bayes Classifier (NBC). 

The result of the experiments shows that SVM-I-DNF achieves the highest 
performance. In comparison with C4.5, SVM-I-DNF is 7.5% more accurate than 
C4.5 at the sample size n = 5000. Also, we see that SVM-I-DNF is most accurate 
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The number of training data 



Fig. 2. Comparison with C4.5 and NBC 



even at the small sample sizes. This is remarkable in the light of the fact that the 
capacity of SVM+DNF is larger than that of C4.5 because every decision tree 
can be represented as a hyperplane with equal separability. In general, the larger 
the capacity, the higher the risk of overfitting, and the overfitting is especially 
visible at small sample sizes. The tendency is exhibited by the results that C4.5 
is less accurate than NBC at the sample size n = 20, 50. As discussed in [5], the 
phenomenon can be explained as C4.5 overfits the small samples more strongly 
than NBC does while limited capacity of NBC alleviates overfitting. According 
to the argument, SVM+DNF has higher risk of overfitting because it has larger 
capacity than C4.5 has. However, the result shows that SVM+DNF is more 
accurate than C4.5 even at the small sample sizes. This means that the capacity 
control of SVM+DNF is effective. 

5.2 Comparison with Gaussian Kernels 

To test the effectiveness of the DNF kernel, SVM+DNF is compared with SVMs 
using Gaussian kernels Kc{xi,Xj) = exp( — ||a;i — Xj\^ 

Figure [3] illustrates dependency of the misclassification rates on sample size, 
where the DNF kernel is compared with Gaussian kernels with different values of 
(T^. From the figure, we see that SVM+DNF has accuracy comparable to SVMs 
using Gaussian kernels with appropriately adjusted width parameters. This can 
be explained as follows. Assume that sameOl ( 0 ;^, aij) = c. Then K{xi,Xj) = 
2‘^ — 1. On the other hand, 

c-d -d log2_e 

Kc{xi,Xj) = ■ 2 . 

-d 

Since the constant e is absorbed in the Lagrange multipliers, K and Kq 
behave similarly provided cr^ is appropriately chosen. 

As described above, the DNF kernel tells a value of the parameter of the 
Gaussian kernel appropriate for learning Boolean functions. However the DNF 
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Fig. 3. Comparison with SVMs using Gaussian kernels 



kernel has a more crucial advantage: comprehensibility of its features. That is, for 
the DNF kernel, the corresponding feature mapping is explicitly defined and each 
feature 4>i is considered as a conjunction. This contrasts with the Gaussian kernel; 
the corresponding feature mapping is defined implicitly and features induced by 
the kernel are hard to interpret. By virtue of this comprehensibility, we can see 
that a coefficient Wi of hyperplanes quantify importance of the conjunction (f>i 
for classification. The author believes that the use of the DNF kernel enables to 
extract crucial features, and these features are helpful to obtain a more accurate 
or comprehensible classifier. 

6 Conclusions and Future Work 

This paper considered a feature space appropriate for the learning of Boolean 
functions. The feature space consists of all possible conjunctions of Boolean liter- 
als, and is appropriate in the sense that any Boolean function can be represented 
as a hyperplane in the space. Furthermore, for the feature space, we can develop 
the DNF kernel that computes the inner product without depending on high 
dimensionality of the space. It enables SVMs to efficiently find optimal hyper- 
planes in the high dimensional feature space. Also, for a limited form of DNF 
consisting of positive Boolean literals, the monotone DNF kernel was presented. 
To explore the capabilities of the SVM employing the DNF kernel, experiments 
on the learning of randomly generated Boolean functions were conducted. The 
experiments showed that the resulting learning system outperforms C4.5 and has 
accuracy comparable to SVMs using best adjusted Gaussian kernels. Although 
the experiments are not extensive, the SVM employing the DNF kernel seems 
to have relative superiority over other learning algorithms. Further empirical 
studies to confirm the hypothesis are needed. 

Although the SVMs efficiently find the optimal hyperplane for any given 
data of Boolean functions, it does not mean that they can learn any Boolean 
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function efficiently because they may require exponentially many examples in 
order to attain high accuracy. The learnability of Boolean formulae has been 
studied extensively I14I7I10I1I . While efficient learnability of DNF is one of the 
main open problem in the learning theory, these studies have shown that even 
a simple class of formulae is not efficiently learnable. Therefore, it seems to be 
difficult to learn DNF for large d. In that case, we have to limit the class of DNF. 
It is worth investigating to explore use of the DNF kernel in order to obtain an 
appropriate limiting parameter. 

From a practical point of view, it is important to make the learning method 
tolerate towards noise. For instance, if a point x appearing more than once in 
a training set has different class labels due to the classification noise, then the 
training set is not linearly separable. In this case, the optimization problem can 
not be solved because the constraint on separability is never satisfied. To cope 
with the difficulty, one can use soft margin optimization techniques [4] that allow 
a given amount of violation of the constraint. We should investigate capabilities 
of these techniques. 

Another interesting future research concerns comprehensibility of learned 
classifiers. In some applications, explanation for the decision made by learned 
classifiers is important as well as accuracy of the classifiers. To obtain the expla- 
nation, features induced by the DNF kernel might be helpful. As mentioned in 
the previous section, a coefficient Wi of hyperplanes quantifies importance of the 
conjunction (j)i for classification. By extracting crucial features, a more accurate 
or comprehensible classifier might be obtainable. The author believes that it is a 
meaningful research direction to study a SVM using a logic-related kernel that 
can learn accurate models and generate an explanation for a specific decision. 
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Abstract. Random sampling techniques have been developed for com- 
binatorial optimization problems. In this note, we report an application 
of one of these techniques for training support vector machines (more 
precisely, primal-form maximal-margin classifiers) that solve two-group 
classification problems by using hyperplane classifiers. Through this re- 
search, we are aiming (I) to design efficient and theoretically guaranteed 
support vector machine training algorithms, and (II) to develop system- 
atic and efficient methods for finding “outliers”, i.e., examples having an 
inherent error. 



1 Introduction 

This paper proposes a new training algorithm of support vector machines (more 
precisely, primal-form maximal-margin classifiers) for two-group classification 
problems. We use one of the random sampling techniques that have been de- 
veloped and used for combinatorial optimization problems; see, e.g., mm- 
Through this research, we are aiming (I) to design efficient and theoretically 
guaranteed support vector machine training algorithms, and (II) to develop sys- 
tematic and efficient methods for finding “outliers”, i.e., examples having an 
inherent error. Our proposed algorithm, though not perfect, is a good step to- 
wards the first goal (I). We show, under some hypothesis, that our algorithm 
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terminates within a reasonable^ number of training steps. For the second goal 
(II), we propose, though only briefly, some approach based on this random sam- 
pling technique. 

Since the present form of support vector machine (SVM in short) was pro- 
posed [S], SVMs have been used in various application areas, and their classiflca- 
tion power has been investigated in depth from both experimental and theoretical 
points of view. Also many algorithms and implementation techniques have been 
developed for training SVMs efflciently; see, e.g., [Mi]. This is because quadratic 
programming (QP in short) problems need to be solved for training SVMs (as in 
the original form) and such a QP problem is, though polynomial-time solvable, 
not so easy. Among speed-up techniques, those called “subset selection” HD 
have been used as effective heuristics from the early stage of the SVM research. 
Roughly speaking, a subset selection is a technique to speed-up SVM training 
by dividing the original QP problem into small pieces, thereby reducing the size 
of each QP problem. Well known subset selection techniques are chunking, de- 
composition, and sequential minimal optimization (SMO in short). (See [8ll3|9j 
for the detail.) In particular, SMO has become popular because it outperforms 
the others in several experiments. Though the performance of these subset se- 
lection techniques has been extensively examined, no theoretical guarantee has 
been given on the efficiency of algorithms based on these techniques. (As far as 
the authors know, the only positive theoretical results are the convergence (i.e., 
termination) of some of such algorithms | 12I6I11| .1 

In this paper, we propose a subset selection type algorithm based on a ran- 
domized sampling technique developed in the combinatorial optimization com- 
munity. It solves the SVM training problem by solving iteratively small QP prob- 
lems for randomly chosen examples. There is a straightforward way to apply the 
randomized sampling technique to design some SVM training algorithm. But 
this may not work well for data with many errors. Here we use some geometric 
interpretation of the SVM training problem [3] and derive a SVM training algo- 
rithm for which we can prove much faster convergence. Unfortunately, though, a 
heavy “book keeping” task is required if we implement this algorithm naturally, 
and the total running time may become very large despite of its good conver- 
gence speed. Here we propose some implementation technique to get around this 
problem and obtain an algorithm with reasonable running time. Our obtained 
algorithm is not perfect in two points: (i) some hypothesis is needed (so far) to 
guarantee its convergence speed, and (ii) the obtained algorithm (so far) works 
only for training SVMs as a primal-form, and it is not suitable for the kernel 
technique. But we think that it is a good starting point towards efficient and 
theoretically guaranteed algorithms. 



^ By “reasonable bound”, we mean some low polynomial bonnd w.r.t. n, m, and i, 
where n, m, and £ are respectively the number of attributes, the number of examples, 
and the number of errorneous examples. 
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2 SVM and Random Sampling Techniques 

Here we explain basic notions on SVM and random sampling techniques. Due 
to the space limit, we only explain those necessary for our discussion. For SVM, 
see, e.g., a good textbook |S], and for random sampling techniques, see, e.g., an 
excellent survey m- 

For support vector machine formulations, we will consider, in this paper, only 
the binary classification by a hyperplane of the example space; in other words, 
we regard training SVM for a given set of labeled examples as the problem 
of computing a hyperplane separating positive and negative examples with the 
largest margin. 

Suppose that we are given a set of m examples Xi, 1 < i < m, in some n 
dimension space, say M". Each example Xi is labeled by yi G {1,-1} denoting 
the classification of the example. The SVM training problem (of the separable 
case) we will discuss in this paper is essentially to solve the following optimization 
problem. (Here we follow |3] and use their formulation. But the above problem 
can be restated by using a single threshold parameter as given in |S].) 

Max Margin (PI) 
min. - {0+ - 9-) 

w.r.t. w = (wi, ..., w„), and 0_, 
s.t. w ■ Xi > 6^ if yi = 1, and 
w ■ Xi < 6- if yi = —1. 



Remark 1. Throughout this note, we use X to denote the set of examples, and 
let n and m denote the dimension of the example space and the number of 
examples. Also we use i for indexing examples and their labels, and Xi and yi 
to denote the zth example and its label. The range of i is always {1, ...,m}. 

By the solution of (PI), we mean the hyperplane that achieves the minimum 
cost. We sometimes consider a partial problem of (PI) that minimizes a target 
cost under some subset of constrains. A solution to such a partial problem of 
(PI) is called a local solution of (PI) for the subset of constraints. 

We can solve this optimization problem by using a standard general QP 
(i.e., quadratic programming) solver. Unfortunately, however, such general QP 
solvers are not scale well. Note, on the other hand, that there are cases where 
the number n of attributes is relatively small, while m is quite large; that is, the 
large problem size is due to the large number of examples. This is the situation 
where randomized sampling technmues are effective. 

We first explain intuitively ouio random sampling algorithm for solving the 
problem (PI). The idea is simple. Pick up a certain number of examples from 
X and solve (PI) under the set of constraints corresponding to these examples. 
We choose examples randomly according to their “weights”, where initially all 



^ This algorithm is not new. It is obtained from a general algorithm given in |10| . 
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examples are given the same weight. Clearly, the obtained local solution is, in 
general, not the global solution, and it does not satisfy some constraints; in other 
words, some examples are misclassified by the local solution. Then double the 
“weight” of such misclassified examples, and then pick up some examples again 
randomly according to their weights. If we iterate this process several rounds, the 
weight of “important examples”, which are support vectors in our case, would 
get increased, and hence, they are likely to be chosen. Note that once all support 
vectors are chosen at some round, then the local solution of this round is the real 
one, and the algorithm terminates at this point. By using the Sampling Lemma, 
we can prove that the algorithm terminates in 0(n log m) rounds on average. 
We will give this bound after explaining necessary notions and notations and 
stating our algorithm. 

We first explain the abstract framework for discussing randomized sampling 
techniques that was given by Gartner and Welzl [10]. (Note that the idea of 
this Sampling Lemma can be found in the paper by Clarkson [ 7 |, where a ran- 
domized algorithm for linear programming has been proposed. Indeed a similar 
idea has been used |T] to design an efficient randomized algorithm for quadratic 
programming. ) 

Randomized sampling techniques, particularly, the Sampling Lemma, is ap- 
plicable for many “LP-type” problems. Here we use {V, (f>) to denote an abstract 
LP-type problem, where I? is a set of elements and (j) is a, function mapping any 
TZ C T> to some value space. In the case of our problem (PI), for example, we 
can regard T> as X and define 0 as a mapping from a given subset i? of X to 
the local solution of (PI) for the subset of constraints corresponding to R. As a 
LP-type problem, we require {T>, </>) to satisfy certain conditions. Here we omit 
the explanation and simply mention that our example case clearly satisfies these 
conditions. 

For any TZ C T>, a basis of TZ is an inclusion-minimal subset B of TZ such that 
4>{B) = (j)(TZ). The combinatorial dimension of {T>,(j)) is the size of the largest 
basis of T>. We will use 6 to denote the combinatorial dimension. For the problem 
(PI), the largest basis is the set of all support vectors; hence, the combinatorial 
dimension of (PI) is at most n + 1. Consider any subset TZ oiT>. A violator of TZ 
is an element e ofD such that cj>{TZU {e}) fy element e of 7?. is extreme 

(or, simply called an extremer) if 4>(JZ — {e}) fy Consider our case. For 

any subset R of X, let {w * be a local solution of (PI) obtained for R. 
Then G X is a violator of R (or, more directly, a violator of {w* ,9’^,9Z)) if 
the constraint corresponding to Xi is not satisfied with {w* ,9’^,9L)- 

Now we state our algorithm as Figure 1. In the algorithm, we use u to denote 
a weight scheme that assigns some integer weight u(xi) to each x^ G X. For this 
weight scheme u, consider a multiple set C/ containing each example Xi exactly 
u(xi) times. Note that (7 has u(X) (= u{xi)) elements. Then by “choose r 
examples randomly from X according to m” , we mean to select a set of examples 
randomly from all subsets of U with equal probability. 

For analyzying the efficiency of this algorithm, we use the Sampling Lemma 
that is stated as follows. (We omit the proof that is given in cni.) 
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procedure OptMargin 

set weight u(xi) to be 1 for all examples in X\ 
r <— 6S^; Yo S = n + 1. 

repeat 

R <— choose r examples from X randomly according to u; 

(w*, 0^,01) is a solution of (PI) for R; 

V <— the set of violators in X of the solution; 
if u(V) < u(X)/(3S) then double the weight u(xi) for all Xi £V\ 
until V = 0; 
return the last solution; 
end-procedure. 

Fig. 1. Randomized SVM Training Algorithm 



Lemma 1. Let (T>, cj)) be any LP-type problem. Assume some weight scheme u 
on T> that gives an integer weight to each element ofT>. Let u(T>) denote the total 
weight. For a given r, 0 < r < u{T>), we consider the situation where r elements 
of V are chosen randomly according to their weights. Let TZ denote the set of 
chosen elements, and let vn be the weight of violators of TZ. Then we have the 
following bound. (Notice that vn is a random variable. Let Exp(ri 7 ^) to denote 
its expectation.) 

Exp(u7?,) < ^ ■ 6 . (1) 

r + 1 

Using this lemma, we can prove the following bound. (For this theorem, we 
state the proof below, though it is again immediate from the explanation in pro].) 

Theorem 1. The average number of iterations executed in the OptMargin al- 
gorithm is bounded by dSlnm = 0(n In m). (Recall |A1| = m and 5 < n -\- 1.) 

Proof. We say a repeat-iteration is successful if the if-condition holds in the 
iteration. 

We first bound the number of successful iterations. For this, we analyze 
how the total weight u{X) increases. Consider the execution of any successful 
iteration. Since u{V) < u{X)/3S, by doubling the weight of all examples in V, 
i.e., all violators, u{X) increases by at most u(X)/(3i5). Thus, after t successful 
iterations, we have u{X) < m(l -I- l/(3(5))‘. (Note that u{X) is initially to.) 

Let Xq C be the set of support vectors of (PI). Note that if all elements of 
Xq are chosen to R, i.e., Xq C R, then there should be no violator for R. Thus, 
at each successful iteration (if it is not the end) some Xi of Xq must not be in R, 
which in turn is a violator of R. Hence, u{xi) gets doubled. Since |Aio| < <5, there 
is some Xi in Xq that gets doubled at least once every S successful iterations. 
Therefore, after t successful iterations, u{xi) > 

Therefore, we have the following upper and lower bounds for u{X). 

2‘/'5 < u{X) < to(1 + 1/(35))‘. 

This implies that t < 3<5 In to (if the repeat-condition does not hold after t suc- 
cessful iterations). That is, the algorithm terminates within 35 In to successful 
iterations. 
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Next estimate how often successful iteration occurs. Here we use the Sampling 
Lemma. Consider the execution of any repeat-iteration. Let u be the current 
weight on X, and let R and V be the set chosen at this iteration and the set of 
violators of R. Then this R corresponds to TZ in the Sampling Lemma, and we 
have u{V) = v-ji. Hence from ([TJ, we can bound the expectation Vr of u{V) by 
(u{X) — r)S/{r + 1), which is smaller than u{X)/{d>5) by our choice of r. Thus, 
the probability that the if-condition is satisfied is at least 1/2. This implies 
that the expected number of iterations is at most twice as large as the number 
of successful iterations. Therefore, the algorithm terminates on average within 

2 • 3(5 In m steps. 

Thus, while our randomized OptMargin algorithm needs to solve (PI) for 
about 6nlnm times on average, the number of constraints needed to consider at 
each time is about 6n^. Hence, if n is much smaller than m, then this algorithm 
is faster than solving (PI) directly. For example, the fastest QP solver up to date 
needs roughly 0{mn^) time. Hence, if n is smaller than then we can get (at 

least asymptotic) speed-up. (Of course, one does not have to use such a general 
purpose solver, but even for an algorithm designed specifically for solving (PI), 
it is better if the number of constrains is smaller.) 

3 A Nonseparable Case and a Geometrical View 

For the separable case, the randomized sampling approach seems to help us by 
reducing the size of the optimization problem we need to solve for training SVM. 
On the other hand, the important feature of SVM is that it is also applicable 
for the nonseparable case. More precisely speaking, the nonseparable case in- 
cludes two subcases: (i) the case where the hyperplane classifier is too weak for 
classifying given examples, and (ii) the case where there are some erroneous ex- 
amples, namely outliers. The first subcase is solved by the SVM approach by 
mapping examples into a much higher dimension space. The second subcase is 
solved by relaxing constraints by introducing slack variables or “soft margin er- 
ror” . In this paper, we will discuss a way to handle the second subcase; that is, 
the nonseparable case with outliers. 

First we generalize the problem (PI) and state the soft margin hyperplane 
separation problem. 



Max Soft Margin (P2) 




w.r.t. w = (u>i, ...,Wn), 0+, 9-, and ^i, 



s.t. w ■ > 9+ - if j/i = 1, 

w ■ Xi < 9- + if j/i = —1, and > 0. 



Here D < 1 is a parameter that determines the degree of influence from outliers. 
Note that D should be fixed in advance; that is, D is a constant throughout 
the training process. (There is a more generalized SVM formulation, where one 
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can change D and furthermore use different D for each example. We left such a 
generalization for our future work.) 

At this point, we can formally define the notion of outliers we are considering 
in this paper. For a given set X of examples, suppose we solve the problem (P2) 
and obtain the optimal hyperplane. Then an example in X is called an outlier if it 
is misclassified with this hyperplane. Throughout this paper, we use £ to denote 
the number of outliers. Notice that this definition of outlier is quite relative; that 
is, relative to the hypothesis class and relative to the soft margin parameter D. 

The problem (P2) is again a quadradic programming with linear constraints; 
thus, it is possible to use our random sampling technique. More specifically, by 
choosing S appropriately, we can use the algorithm OptMargin of Figure 1 here. 
But while 5<n + m+ lis trivial, it does not seen|j trivial to derive a better 
bound for S. On the other hand, the bound S<n + m+ l is useless in the 
algorithm OptMargin because the sample size 66^ is much larger than m, the 
number of all examples given. Thus, some new approach seems necessary. 

Here we introduce a new algorithm by reformulating the problem (P2) in a 
different way. We will make use of an intuitive geometric interpretation to (P2) 
that has been given by Bennett and Bredensteiner jS]. 

Bennett and Bredensteiner [3j proved that (P2) is equivalent to the following 
problem (P3); more precisely, (P3) is the Wolfe dual of (P2). 



Reduced Convex Hull 



. 1 

mm. - 
2 




s.t. ^ 


i 

^ 5 ^ = 1 , 



(P3) 

W.r.t. Si, 

Si = 1, 



and 



i: Vi = l 



i- yi=-i 



0 <Si< D. 



Note that || X)* = \\J2i-.yi=i T^^^t is, the 

value minimized in (P3) is the distance between two points in the convex hull of 
positive and negative examples. In the separable case, it is the distance between 
two closest points in two convex hulls. On the other hand, in the nonseparable 
case, we give some restriction to the influence of each example; each example 
cannot contribute to the closest point more than D. 

As mentioned in |3], the meaning of D is intuitively explained by considering 
its inverse k = 1/D. (Here we assume that 1/D is an integer. Throughout this 
note, we use k to denote this constant.) Instead of the original convex hulls, we 
consider the convex hulls of points composed from k examples. Then resulting 
convex hulls are reduced ones and they may be separable by some hyperplane; in 
the extreme case where k = to_|_ (where m_|_ is the number of positive examples), 
the reduced convex hull for positive examples consists of only one point. 

^ In the submission version of this paper, we claim that S < n + £+l, thereby deriving 
an algorithm by using the algorithm OptMargin. We, however, noticed later that it 
is not that trivial. Fortunately, the bound n + ^ + 1 is still valid, which we found 
quite recently, and we will report this fact in our future paper |2]. 
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More formally, we can reformulate (P3) as follows. Let Z be the set of com- 
posed examples zj that is defined by zj = (xi^ + Xi^ -I- • • • -I- Xi^jk^ with 
some k distinct elements Xi-^, Xi^, Xi^, of X with the same label (i.e., 
2/ii = 2/i2 = ■ ■ ■ = Vik)- The label yj of the composed example zj inherits 
its members’. Throughout this note, we use I for indexing elements of Z and 
their labels. The range of I is {1, where M \Z\. Note that M < (™). 

For each zj, we use zi to denote the set of original examples from which zi is 
composed. Then (P3) is equivalent to the following problem (P4). 



Convex Hull of Composed Examples (P4) 



. 1 
mm. - 
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^yisizi 


2 

w.r.t. Si, . 


.., Sm 


I 







s.t. ^ Si = 1, ^ Si = 1, and 0 < s/ < 1. 

I:yi = l I-.yj--l 

Finally we consider the Wolfe primal of this problem. Then we came back to 
our favorite formulation! 

Max Margin for Composed Examples (P5) 
min. - (77+ - ?7_) 

w.r.t. w = (wi, ..., w„), r]+, and ?7_ 
s.t. w ■ Zi > 7]^ if P/ = 1, and 
w ■ Zi < Tj- if j/7 = — 1. 

Note that the combinatorial dimension of (P5) is n -I- 1, the same as that of 
(PI). The difference is that we have now M = 0{m^) constraints, which is quite 
large. But this situation is suitable for the sampling technique. Suppose that we 
use our algorithm based on the randomized sampling technique (OptMargin of 
Figure 1) for solving (P5). Since the combinatorial dimension is the same, we can 
use r = 6(n-|- 1)^ as before. On the other hand, from our analysis, the expected 
number of iterations is O(nlnM) = O(fcnlnm). That is, we need to solve QP 
problems with n -I- 2 variables and O(n^) constraints for O(fcnlnm) times. 

Unfortunately, however, there is a serious problem. The algorithm needs, at 
least as it is, a large amount of time and space for “book keeping” computation. 
For example, we have to keep and update weights of all M composed examples 
in Z, which requires at least 0{M) steps and 0{M) space. But M is huge. 

4 A Modified Random Sampling Algorithm 

As we have seen in Section 4, we cannot simply use the algorithm OptMargin for 
(P5). It takes too much time and space to maintain the weight of all composed 
examples and to generate them according to their weights. Here we propose a 
way to get around this problem by giving weight to original examples; this is 
our second algorithm. 
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Before stating our algorithm and its analysis, let us first examine solutions 
to the problems (P2) and (P5). For a given example set X, let Z be the set of 
composed examples. Let {w*,9^,0’L) and {w * be the solutions of (P2) 
for X and (P5) for Z respectively. Note that two solutions share the same w*; 
this is because (P2) and (P5) are essentially equivalent problems [3]. Let ^err,+ 
and XsTT,- denote the sets of positive/negative outliers. That is, Xi belongs to 
Xerr.+ (resp., Xerr,-) if and only if = 1 and w* ■ Xi < 9^ (resp., yt = — 1 
and w* ■ Xi > 9t_). We use £+ and £_ to denote the number of positive/negative 
outliers. Recall that we are assuming that our constant k is larger than both £+ 

and £_. Let Xen- = Xerr,+ U Xerr,-- 

The problem (P5) is regarded as the LP-type problem {'D,(j)), where the 
correspondence is the same as (PI) except that Z is used as T> here. Let Zq be 
the basis of Z . (In order to simplify our discussion, we assume nondegeneracy 
throughout the following discussion.) Note that every element of the basis is 
extreme in Z. Hence, we call elements of Zq final extremers. By definition, the 
solution of (P5) for Z is defined by the constraints corresponding to these final 
extremers. 

By analyzing the Karush-Kuhn- Tucker (in short, KKT) condition for (P2), 
we can show the following facts. (Though the lemma is stated only for the positive 
case, i.e., the case yi = 1, the corresponding properties hold for the negative case 
VI = -1-) 

Lemma 2. Let zi be any positive final extremer, i.e., an element of Zq sueh 
that yi = 1. Then the following properties hold: (a) w* ■ zj = rj^. (b) Xerr.+ C 
z/. (c) For every Xi G zj, if Xi ^ Xgrr,+, then we have w* ■ Xi = 9’^. 

Proof, (a) Since Zq is the set of final extremers, (P5) can be solved only with 
the constraints corresponding to elements in Zq. Suppose that w * -zj > yX 
(resp., w* ■ Zj < yf) for some positive (resp., negative) Zj G Zq including Zj 
of the lemma. Let Z' be the set of such zfs of Zq. If Z' indeed contained all 
positive examples in Zq, then we could set 0+ with — e for some e > 0 and 
still satisfy all the constraints, which contradicts the optimality of the solution. 
Hence, we may assume that Zq — Z' still has some positive example. Then it is 
well known (see, e.g., E) that a local optimal solution to the problem (P5) with 
the constraints corresponding to elements in Zq is also locally optimal to the 
problem (P5) with the constraints corresponding to only elements in Zq — Z' . 
Furthermore, since (P5) is a convex programming, a local optimal solution is 
globally optimal. Thus, the original problem (P5) is solved with the constrains 
corresponding to elements in Zq — Z' . This contradicts our assumption that Zq 
is the set of final extremers. 

(b) Consider the KKT-point (ic*, 01, s*, w*) of (P2). Then the point 

must satisfy the following so called KKT-condition. (Below we use i to denote 
indices of examples, and let P and N respectively denote indices i of examples 
such that yi = 1 and yi = 0. We use e to denote the vector with 1 at every 
entry.) 
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w* — SiXi + s*Xi = 0, De — s* — u* = 0, 

iGP i&N 

-1 + ^4 = 0 , + = 0 , 

ieP ieN 

V* G P [ s*{w* -x,-0l + e*) = 0 ], Vi G IV [ . a;, - r - C) = 0 ], 

u* ■ = 0 (which means (He — s*) • ^* = 0), and ^*,u*,s* > 0. 

Note that is an optimal solution of (P2), since (P2) is a convex 

minimization problem. From these requirements, we have the following relation. 
(Note that the condition s* < De below is derived from the requirements De — 
s* — u* = 0 and u* > 0.) 

w = 2_^s,Xi- 2_^s,Xi, 

ieP ieN 

s* = 1, = 1, and 0 < s* < De. 

ieP ieN 

In fact, s* is exactly the optimal solution of (P3). 

Here by the equivalence of (P4) and (P5), we see that the final extremers are 
exactly points contributing to the solution of (P4). That is, we have zj G Zq 
if and only if s*j > 0, where s*j is the Ith element of the solution of (P4). 
Furthermore, it follows the equivalence between (P3) and (P4), for any i, we 
have 



i- Y. »: = »:■ ( 2 ) 

I-.XiGzi 

Recall that each zj is defined as the center of k examples of X. Hence, to show 
that every Xi G Xgrr,+ appears in all positive final extremers, it suffices to show 
that s* = 1/k for every Xi G Xerr,+ , which follows from the following argument. 
For any Xi G JVerr.+ , since > 0, it follows from the requirements {De — s*) ■ 

= 0 and De — s* > 0 that D — s* = 0; that is, s* = 1/k for any Xi G Xerr,+. 

(c) Consider any index z in P such that Xi appears in some of the final extremer 
zj G Zq. Since s} > 0, we can show that s* > 0 by using the equation ((2|). Hence, 
from the requirement Si{w* • 3 :^ — + ^*) = 0 , we have 

w*-x,-0l + ^* = 0 . 

Thus, if Xi ^ Xerr, be., it is not an outlier or = 0, then we have w* ■ Xi — 01j_. 

Let us give some intuitive interpretation to the facts given in this lemma. 
(Again we only consider, for the simplicity, the positive examples.) First note 
that the fact (b) of the lemma shows that all final extremers share the set Aerr,+ 
of outliers. Next it follows from the fact (a) that all final extremers are located 
on some hyperplane whose distance from the base hyperplane w* ■ z — 0 is rj^. 
On the other hand, the fact (c) states that all original normal examples in a final 
extremer zj (i.e., examples not in Aerr,+) are located again on some hyperplane 
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whose distance from the base hyperplane is 0’^ > Here consider the point 
v+ /^+’ center of positive outliers, and define 

= w* ■ v+. Then we have 9*j^ > rf^ > that is, the hyperplane defined by 
the final extremers is located between the one having all normal examples in the 
final extremers and the one having the center u_|_ of outliers. More specifically, 
since every final extremer is composed from all (.+ positive outliers and fc — £+ 
normal examples, we have 



■■ 0+-0*+ = k-i+ : e+. 

Next we consider local solutions of (P5). We would like to solve (P5) by 
using the random sampling technique. That is, choose some small subset R oi Z 
randomly according to current weight, and solve (P5) for R. Thus, let us examine 
local solutions obtained by solving (P5) for such a subset R of Z. 

For any set R of composed examples in Z, let {w,rj+,rj_) be the solution of 
(P5) for R. Similar to the above, we consider Zq to be the set of extremers of R 
w.r.t. the solution On the other hand, we define here X to be the 

set of original examples appearing in some extremers in Zq. 

As before, we will discuss about only positive composed/original examples. 
Let Zq^+ be the set of positive extremers. Different from the case where all com- 
posed examples are examined to solve (P5), here we cannot expect, for example, 
that all extremers in .^o,+ share the same set of misclassified examples. Thus, 
instead of sets like Aerr.+, we consider a subset X'_^ of the following set X+. (It 
may be the case that is empty.) 



A+ = the set of positive examples appearing in all extremers in 

Intuitively, we want to discuss by using the set X'j^ of “misclassified” examples 
appearing in all positive extremers. But such a set cannot be defined at this 
point because no threshold corresponding to 0\ has been given. Thus, for a 
while, let us consider any subset X'j^ of X+. Let = |A(|_|, and define = 

!^'+- ®^ch zi € Zq, we define a point v/ that is the center 

of all original examples in z/ — X^. That is, 

def ^X,ezj-X^ 

Vi = ^ . 

Then we can prove the following fact that corresponds to Lemma [2] and that is 
proved similarly. 

Lemma 3. For any subset R of Z , we use the symbols defined as above. There 
exists some 9'j^ such that for any extremer zi in Zq^+, we have w ■ vi = 9'j^. 
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Now for our Xerr,+, we use a subset X'j_ of X+ defined by X'j^ = { g 
X+ : w ■ Xi < where 0^, which we denote 0err,+, is the threshold given in 
Lemma |3] for X'j^. Such a set (while it could be empty) is well-defined. (In the 
case that Xerr,+ is empty, we define 0err,+ = 9 +-) 

For any original positive example Xi g X, we call it a missed example (w.r.t. 
the local solution {w,rj+,rj-)) if Xi ^ ^err,+ and it holds that 

w • x-i 9 qyy_-\-. (3) 

We will use such a missed example as an evidence that there exists a “violator” 
to {w,rj+^rj-), which is guaranteed by the following lemma. 

Lemma 4. For any subset R of Z, let {w,rj^,rj-) be the solution of (P5) for 
R. Then if there exists a missed example w.r.t. then we have some 

composed example in Z that is misclassified w.r.t. (w,rj^,rj-). On the other 
hand, for any composed example Zj g Z, if it is misclassified w.r.t. (w,rj^,rj-), 
then zi contains some missed example. 

Proof. We consider again only the positive case. Suppose_that some missed pos- 
itive example Xi exists. By definition, we have w ■ Xi < 0err,+, and there exists 
some extremer zj g .^o,+ that does not contain Xi. Clearly, zj contains some 
example Xj such that w xj > 0en-,+ - Then we can see that a composed elements 
zj consisting of zi — {xj} U {xi} does not satisfy the constraint w ■ zj > rj+. 

For proving the second statement, note first that any “misclassified” original 
example Xi, i.e., an example for which the inequality ([H]) holds, is either a missed 
example or an element of Xerr,+ - Thus, if a composed element Zj does not contain 
any missed example, then it cannot contain any misclassified examples other 
than those in Xerr.+- Then it is easy to see that w ■ zj > rj+; that is, zj is not 
misclassified w.r.t. {w,rj+,rj-). 

We explain the idea of our new random sampling algorithm. As before, we 
choose (according to some weight) a set R consisiting of r composed examples in 
Z, and then solve (P5) for R. In the original sampling algorithm, this sampling 
is repeated until no violator exists. Recall that we are regarding (P5) as an 
LP-type problem and that by “a violator of i?” , we mean a composed example 
that is misclassified with the current solution (w,rj+,rj-) of (P5) obtained for 
R. Thanks to the above lemma, we do not have to go through all composed 
examples in order to search for a violator. A violator exists if and only if there 
exists some missed example w.r.t. (th, rfy, ^_). Thus, our first idea is to use the 
existence of missed example for the stopping condition. That is, the sampling 
procedure is repeated until no missed example exists. 

The second idea is to use the weight of examples Xi in X to define the weight 
of composed examples. Let Ui denote the weight of the ith example Xi. Then for 
each composed example zj g Z, its weight Uj is defined as the total of weights 
of all examples contained in Zj; that is, t// = J2x gzi symbols u and 

U to refer these two weight schemes; we sometimes, use these symbols to denote 
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procedure OptMarginComposed 
Ui ^ 1, for each i, 1 < i < m; 

r <— Qafin-, % For a and j3, see the explanation in the text, 
repeat 

R <— choose r elements from Z randomly according to their weights; 
{w,rj+,ri-) <— the solution of (P5) for R-, 

Xerr <— the Set of missed examples w.r.t. the above solution; 
if ri(Xerr) < u(X)/(3/3) then Ui <— 2ui for each Xi G Xerr; 
until no missed example exists; 
return the last solution; 
end-procedure. 

Fig. 2. A Modified Random Sampling Algorithm 



mapping from a set of (composed) examples to its total weight. For example, 
u{X) = U{Z) = explained below, it is computationally 

easy to generate each zj with probability Ui/U{Z). 

Our third idea is to increase weights Ui if it is a missed example w.r.t. the 
current solution. More specifically, we double the weight Ui if Xi is a missed 
example w.r.t. the current solution for R. Lemma 01 guarantees that the weight 
of some element of a final extremer gets doubled so long as there is some missed 
example. This property is crucial to estimate the number of iterations. 

Now we state our new algorithm in Figure 2. In the following, we explain 
some important points on this algorithm. 

Random Generation of R 

We explain how to generate each zi proportional to Ui. Again we only consider 
the generation of positive composed examples, and we assume that all positive 
examples are re-indexed as a;i, ..., x^- Also for simplifying our notation, we reuse 
m amd M to denote and M_|_ respectively. 

Recall that each zj is defined as {xi^ + - ■ ■ + Xi^,)/k, where Xi . is an element of 
zj. Here we assume that ik < ik-i < • ■ ■ < ii - Then each zj uniquely corresponds 
to some fc-tuple (zfc,...,*i), and we identify here the index I of zj and this k- 
tuple. Let I be the set of all such A:-tuples (z^, ..., *i) that satisfy 1 < < m (for 

each j, 1 < j < k) and z^ < • • • < zi. Here we assume the standard lexcographic 
order in X. 

As stated in the above algorithm, we keep the weights ui, ..., Um of examples 
in X. By using these weights, we can calculate the total weight U{Z) = 
Similarly, for each z/ G Z, we consider the following accumulated weight U{I). 

Y^Uj. 

.]<! 

As explained below, it is easy to compute this weight for given z/ G Z. Thus, 
for generating z/, (i) choose p randomly from {1, ...,U{Z)}, and (ii) search for 
the smallest element I oil such that U{I) > p. The second step can be done by 
the standard binary search in {1, ...,M}, which needs logM (< fclogm) steps. 
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We explain a way to compute U{I). First we prepare some notations. Define 
V{I) = J2j>i Uj- Then it is easy to see that (i) Uq = ^((1, 2, 3, fc)), and (ii) 

U{I) = Uo-V{I) + {ui^ + Ui^_^ H hMii) for each / = {ik, ik-i, -■•Di)- Thus, 

it suffices to show how to compute V{I). 

Consider any given I = ••■Di) in T- Also for any j, 1 < j < fc, we 

consider the prefix /' = of /, and define the following values. 

Nj = # of /' such that I' > and 

Vj = {ui'_ + Ui'_-i + ■ ■ ■ + Ui'^) . 

'' J ’ ’ 1/— J 

Then clearly we have V{I) = 14, and our task is to compute 14, which can be 
done inductively as shown in the following lemma. (The proof is omitted.) 

Lemma 5. Consider any given I = (4, 4-i) T, and use the symbols 

defined above. Then for each j , 1 < j < fc, the following relations hold. 

N, = and V, = Y. 

^ d / ij + l<i<m ^ 

Stopping Condition and Number of Successful Iterations 

The correctness of our stopping condition is clear from Lemma 01 We estimate 
the number of the repeat-iterations. Here again we say that the repeat-iteration 
is successful if the if-condition holds. We give an upper bound for the number of 
successful iterations. 

Lemma 6. Set (3 = k(n + 1) in the algorithm. Then the number of successful 
iterations is at most "ikfn + 1) Inm. 

Proof. For any t > 0, we consider the total weight u{X) after t successful it- 
erations. As before, we can give an upper bound u{X) < m{l + l/(3/3))‘. On 
the other hand, some missed example exists at each repeat-iteration, and from 
LemmaS] we can indeed find it in any violator, in particular, some final extremer 
zj G Zq. Thus, there must be some element Xi of UzieZo^i whose weight Ui gets 
doubled at least once every fc(n-|- 1) steps. (Recall that |^ol 4 n+1.) Hence, we 
have 



2CfcO+i) < u{X) < m(l + l/(3/3))‘. 

This implies, under the above choice of /?, that t < 3k{n + 1) Inm. 

Our Hypothesis and the Sampling Lemma 

Finally, the most important point is to estimate how often we would have suc- 
cessful iterations. At each repeat-iteration of our algorithm, we consider the ratio 
Pmiss = u{Xerr) /u{X). Recall that the repeat-iteration is successful if this ratio 
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is at most l/(3/3). Our hypothesis is that the ratio Pmiss is, on average, bounded 
by l/(3/3). Here we discuss when and for which parameter j3, this hypothesis 
would hold. 

For the analysis, we consider “violators” to the local solution of (P5) obtained 
at each repeat-iteration. Let R be the set of r' composed examples randomly 
chosen from Z with the probability proportional to their weights determined by 
U. Recall a violator of i? is a composed example zj G Z that is misclassified 
with the obtained solution for R. Let V be the set of violators, and let vr be 
its weight under U . Recall also that the total weight of Z is U{Z). Thus, by the 

Sampling Lemma, the ratio Pvio vr/U{Z) is bounded as follows. 



Exp(pvio) < 



{n+l){U{Z)-r') 
r' + 1 



1 ^ 

u{z) - ?' 



From Lemma |H we know that every violator should contain at least one 
missed example. On the other hand, every missed example would contribute 
to some violator. Hence, it seems reasonable to expect that the ratio Pniiss is 
bounded by a ■ pvio for some constant a > 1, or at least it holds quite often if it 
is not always true. (It is still o.k. even if a is a low degree polynomial w.r.t. n.) 
Here we propose the following technical hypothesis. 

(Hypothesis) Pmiss < a • Pvio, for some a > 1. 

Under this hypothesis, we have /Omiss ^ on average; thus, by taking r' = 
6a/3n, we can show that the expected ratio Pmiss is at most 1/6/3, which implies 
as before that the expected number of iterations is at most twice as the number 
of successful iterations. Therefore the average number of iterations is bounded 
by 6fc(n -I- 1) Inm. 



5 Concluding Remarks: Finding Outliers 

In computational learning theory, one of the recent important topics is to develop 
an effective method for handling data with inherent errors. Here by an “inher- 
ent error”, we mean an error or noise that cannot be corrected by resampling. 
Typically, an example that is mislabeled and this mislabeled situation does not 
change even though we resample this example again. Many learning algorithms 
fail to work under the existence of such inherent errors. SVMs are more robust 
against errors, but it is still the state of art to determine parameters for erro- 
neous examples. More specifically, the complexity of classifiers and the degree D 
of the influence of errors are usually selected based on the experts’ knowledge 
and experiences. 

Let us fix a hypothesis class as the set of hyperplanes of the sample domain. 
Also suppose, for the time being, that the parameter D is somehow appropriately 
chosen. Then we can formally define erroneous examples — outliers — as we did 
in this paper. Clearly, outliers can be identified by solving (P2); by using the 
obtained hyperplane, we can check whether a given example is an outlier or not. 
But it would be nice if we can find outliers on the course of our computation. 
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As we discussed in Section 5, outliers are not only misclassified examples but 
also misclassified examples that commonly appear in support vector composed 
examples. Thus, if there is a good iterative way to solve (P5), we may be able 
to identify outliers by checking for commonly appearing misclassified examples 
in support vector composed examples of each local solution. We think that our 
second algorithm can be used for this purpose. 

Also a randomized sampling algorithm for solving (P5) can be used to de- 
termine the parameter D = 1/k. Note that if we use k that is not large enough, 
then (P5) does not have a solution; there is no hyperplane separating composed 
examples. In this case, we would have more violators than we expect. Thus, by 
running a randomized sampling algorithm for (P5) several rounds, we can de- 
tect that the current choice of k is too small if an unsuccessful iteration (i.e., an 
iteration where the if-condition fails) occurs frequently. Thus, we can revise k at 
an earlier stage. 
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Abstract. This paper develops a theory for learning scenarios where 
multiple learners co-exist but there are mutual coherency constraints on 
their outcomes. This is natural in cognitive learning situations, where 
“natural” constraints are imposed on the outcomes of classifiers so that 
a valid sentence, image or any other domain representation is produced. 
We formalize these learning situations, after a model suggested in 
and study generalization abilities of learning algorithms under these con- 
ditions in several frameworks. We show that the mere existence of co- 
herency constraints, even without the learner’s awareness of them, deems 
the learning problem easier than predicted by general theories and ex- 
plains the ability to generalize well from a fairly small number of exam- 
ples. In particular, it is shown that within this model one can develop 
an understanding to several realistic learning situations such as highly 
biased training sets and low dimensional data that is embedded in high 
dimensional instance spaces. 



1 Introduction 

A fundamental research effort in learning theory has been the study of generaliza- 
tion abilities of learning algorithms and their dependence on sample complexity. 
The importance of this research direction goes beyond intellectual curiosity. Un- 
derstanding the inherent difficulty of learning problems allows one to evaluate 
whether learning is at all possible in certain situations, estimate the degree of 
confidence in the predictions made by learned classifiers and is crucial in un- 
derstanding and analyzing learning algorithms. In particular, these theoretical 
considerations played a crucial role in the development of practical learning 
approaches |7I9I6| . 

One puzzling problem from a theoretical and a practical point of view is 
the contrast between the hardness of learning problems, as suggested by various 
bounds on sample complexity and generalization - even for fairly simple concepts 
- and the apparent ease at which the cognitive systems seem to learn those con- 
cepts. Cognitive systems seem to use far less examples and learn more robustly 
than is predicted by the theoretical models developed so far. 

This work develops a learning theory that explains this phenomenon. Fol- 
lowing m our approach is based on the observation that cognitive learning 

N. Abe, R. Khardon, and T. Zeugmann (Eds.): ALT 2001, LNAI 2225, pp. 135- |150l 2001. 
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problems do not usually occur in isolation. Rather, the input is observed by 
multiple learners that may learn different functions on the same input. We pur- 
sue this direction by developing a theory for learning scenarios where multiple 
learners co-exist but there are mutual compatibility constraints on their out- 
comes. We believe that this is natural in cognitive learning situations, where 
“natural” compatibility constraints are imposed on the outcomes of classifiers 
so that a valid sentence, image or other domain representation is produced. In 
particular, this model can be viewed as a theoretical framework for learning in 
multi-modal situations. 

Assume, for example, that one is trying to learn a function that deter- 
mines, given a sentence which contains one of {weather, whether} which of the 
two should actually occur in the sentence. E.g., given the sentence I did not 
know weather to laugh or cry determine if weather should be replaced by 
whether. The function learned to perform this task may be fairly complicated; 
it could depend on a huge number of features such as words neighboring the 
target words in sentences, their syntactic tags, etc. [9]. Notice, however, that 
the same sentence could be supplied as input to a different function that pre- 
dicts the part-of-speech (pos) of the word weather (and others) in this sentence. 
However, the predictions of these functions are not independent. For example, if 
the pos function determines, in a given context, that the target word is a noun, 
then the spelling function cannot determine that the correct spelling is whether. 
Other more intricate constraints exist with other functions that can receive this 
sentence as input. Consequently, perhaps, even though the data for problems of 
these sort typically reside in very high dimensional space (e.g., 10^ to 10®), one 
is able to achieve good classification performance (on test data) by looking at 
relatively few training examples; very few relative to what is expected by theory 
and is needed in simulations of synthetic data of this dimensionality. Similar phe- 
nomena exist when learning to detect faces or properties of faces (e.g., gender) 
in visual learning problems. 

This exemplifies our notion of coherency constraints: given that these two 
functions need to produce coherent outputs, the input sentence may not take any 
possible value in the input space (that it could have taken when the function’s 
learnability is studied in isolation) but rather may be restricted to a subset of 
the inputs on which the functions outcomes are coherent. In this paper we model 
these learning situations and develop a learning theory that attempts to explain 
these phenomena. 

Notations: We consider the standard scenario of concept learning from ex- 
amples. A learner is trying to identify a binary classifier c : X ^ {0, 1} when 
presented with examples {x, y), where instances x G X{= 3?”) are drawn accord- 
ing to a fixed (but unknown) distribution on X and labeled y = c{x). m denotes 
the number of training examples. TL denotes the hypothesis space (the class of 
functions from which a hypothesis is selected), |7t| is its cardinality and h G H 
refers to the learned hypothesis. 

While our goal is to learn a single target concept, c : 3?” ^ {0, 1}, we are 
interested in studying situations in which the learning scenario involves several 
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concepts Ci, C 2 , . . . , Cfc. We further assume that the concepts are subject to a 
constraint g {g : X x {0, 1}^ ^ {0, 1}) which is fixed (but could be probabilis- 
tic) and unknown to the learner. The constraint refiects the fact that all these 
functions represent different aspects of some natural data, as in the example 
above. We formalize this learning scenario and show that the mere existence of 
the other functions along with the constraints Nature imposes on the relations 
between these functions - all unknown to the learner - contribute to the effective 
simplification of the task of learning ci . 

The effect of constraints on the learning is analyzed by studying three models, 
with increased generality. We start by a pac analysis of the finite hypothesis 
class case, under coherency constraints. We then relax some of the assumptions 
and move to study constraints in the more general equivalence class framework. 
This allows us to develop a view of coherency as an equivalence relation on the 
hypothesis class. This view is beneficial in understanding conditions under which 
learning becomes easier and supports better generalization, even when the same 
hypothesis class is used. Finally, we develop a general VC-dimension view of 
coherency constraints. We show that these can be analyzed as a way to restrict 
the effective number of dichotomies and thus VC-dimensions techniques can be 
used to derive generalization bounds. 

We also provide some examples that serve to motivate the framework and 
exemplify its power as well as some experimental evidence to its validity. In 
particular, we show that within this framework one can study and develop an 
understanding to several realistic learning situations such as highly biased train- 
ing sets and low dimensional data embedded in high dimensional instance spaces. 

2 Coherency Constraints 

The usual way to constrain the learning task is to explicitly restrict the concept 
class. Instead, here we are concerned with the case in which the restriction 
is imposed implicitly via interaction among concepts. More precisely, we are 
interested in learning the concept ci in a situation that involves several concepts 
Cl, C 2 , . . . , c/c; Ci : X — > {0, 1}, and a global constraint g : X x {0, 1}^ ^ {0, 1} 
on the outcomes of these concepts. 

The concept of coherency constraints has been formalized in m where it was 
shown that it can be used to explain the “easiness” and robustness of learning in 
some restricted situations. Several semantics for coherency were discussed there. 
The notion of Class Coherency was developed to indicate coherency at the level 
of the outcome of the classifiers; this turns out to be too restrictive in that it 
restricted the hypothesis class to include only functions which are coherent with 
each other over all samples. This notion was then relaxed to define Distributional 
Coherency. In this case the hypothesis space is not restricted; rather, the effect 
of distributional coherency is in disallowing some of the instances - those on 
which the constraints are not satisfied - to occur in the input. Results are given 
for the case of mistake bound learning of half spaces under specific constraints. 
It was also shown that learning concepts under this model results in hypothesis 
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that are more robust to attribute noise. Similar ideas on robustness have also 
been discussed in |2l8j . 

The model studied in this paper builds on the distributional coherency model 
but extends it in several directions. First, we generalize it to general constraints; 
we extend it to a probabilistic setting and allow constraints to apply only with 
some probability; and, under these conditions we develop techniques to analyze 
general classifiers 

Although we are interested in learning a single function ci , in the following 
definition we will consider it along with the (possibly) constraining functions 
C2, . . . Cfe, and denote c = (ci, C2, . . . Cfc). Thus the constraints in the following 
definition are imposed on the direct product C^. The semantics is that for each 
c S C^, we restrict the domain of c to X' where, with high probability, the 
constraint is satisfied; that is, Vx € X\g{c{x)) = 1. Moreover, we allow g to 
depend on x (denoted g^), so that the constraints can take a very general form. 

Definition 1 (Distributional Coherency). Let C be a elass of funetions c : 
X — !■ {0, 1}, g : X X {0, 1} — *■ {0, 1} a Boolean constraint, and a S [0, 1] a 
constant. We define the class of g-coherent functions C* to be the collection of 
all functions c* : A" ^ {0, 1}^ U {*} in defined by 



The value is interpreted as a forbidden value for the function c. Thus we 
restrict the domain of c to the subset X' of X satisfying the constraint g. 

The above definition restricts the set of functions we study. Equivalently, 
we can say that all functions can be studied, but the distribution of the data 
observed is restricted to respect the coherency constrains. This is made explicit 
in the following definition for coherency, the one used in the rest of the paper. 

Definition 2. The functions c = (ci, C 2 , . . . c^) are a-coherent if 



where P is the probability according to which instances in X are drawn. 

In the pac learning model the above constraint can be interpreted as restricting 
the class of distributions when learning a function ci G C. Only distributions 
giving zero (or small, depending on a) weight to the region X \ X' are allowed. 

We note that this is different from the model of distribution specific learning 
(e.g., [5]). There, the learner is explicitly aware of the underlying distribution and 
can utilize it directly. Our model is based on assuming that this is unrealistic; 
instead, we assume a distribution free model in which the distribution could be 
constrained in intricate ways. The learner is unaware of this. However, as we 
show, under this model the learning problem nevertheless becomes easier and 
we can justify the generalization abilities of the learned hypothesis even in the 
presence of relatively small number of training examples. 




* otherwise 




P{x\g,c{ci,C2, . . .Ck)} = a 



( 1 ) 
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3 PAC Analysis of Coherency Constraints 



Consider the pac model |T^ analysis for the admissible case when the hypothesis 
space is finite. That is, it is assumed that during the training phase one is 
provided with m training examples drawn independently according to P and 
labeled according to some target concept ci. The learning algorithm chooses a 
hypothesis h &TL that is consistent with target function on the training data. In 
this setting [5] we know that for the true error of h (that is, Prp{h{x) ^ ci(x))) 
to be bounded by e with probability at least 1 — <5, the number of training 
examples required needs to be greater than 



m > 



In \H\ + In y 
e 



(2) 



This analysis can be extended to the non admissible case (when the target func- 
tion is not in Ti.) and the assumption on \H\ being finite can be relaxed to a 
finite VC dimension. Eqn. gives the relation between the true error and the 
sample complexity. On fixing the confidence (S) and the hypothesis class ( 7 i), 
the sample complexity is inversely proportional to the true error. 

We prove next an analogous result that exhibits the effect of the coherency 
constraints. W.l.o.g we present it for the case k = 2 . As above, our goal is to 
learn a hypothesis h that approximates Ci ; the hypothesis is chosen such that it 
is consistent on m training examples with the target concept c\ and thus, based 
on the above, (with confidence <5, which we fix for the rest of the discussion) it 
has a true error of e\. We now analyze the effect the presence of the coherency 
constraint has on h's performance. Before we do that, and in order to simplify 
the discussion that follows, we note that it is always possible to think about the 
coherency constraint as an equality constraint. The reason is that we can always 
replace C2 by C2 deterministically, via the graph of g. Namely, C2 is defined so that 
when gx(ci{x),C2{x)) = 1, C2(x) = C2{x) if ci{x) = C2{x) and C2(x) = ^C2(x) 
otherwise. When gx(ci(x), C2(x^) = 0 we define C2 exactly in the opposite way, 
yielding Vx,gx(ci,C2) = [ci = C2]. 

Assume the existence of a concept C2, such that the learned hypothesis h 
has a true error of £2 w.r.t. C2. Also, we assume that C2 coheres with the target 
function c\ via g, that is: 



P{x\gx{ci,C 2 )} = a. (3) 

Consequently, we care about the performance of h only under these constraints. 
In the following discussion we assume that the outcomes of the concepts Ci , C2 
are independent given the outcome of the hypothesis In addition, we make 
the technical assumption that the labels of C2 are symmetric with respect to h, 

^ Note that we assume that Ci,C2 are independent given h\ in fact, they may be very 
dependent. This is a reasonable assumption in many situations, e.g., those presented 
in Sec. [1] Specifically, this is the situation in cases that involve multiple classifiers 
(e.g., learning across modalities). 
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namely, we assume tha10 Pr{c2 = 0\h = 1) = Pr{c2 = l\h = 0). As shown in 
the next theorem, under these conditions, the hypothesis h learned to approxi- 
mate Cl, actually achieves true error - relative to instances that are subject to 
coherency constraints - that is smaller than ei . Equivalently, in order to achieve 
true error of ei one needs to train on less than the m examples of Eqn. [2] 

Theorem 1. Let ci be a target eoncept and h a learned hypothesis that has true 
error ei relative to it, based on Eqn. Assume that h has true error t2 with 
respect to C2 ■ Then, the true error of h with respect to Ci on the data satisfying 
the constraint g (Eqn. \^), and under the conditions given, is given by 

^ ei£2Q ^ ei(l - £2)(1 - a) , , 

(1 — £i)(l — £2) + ei£2 £i(l — £2) + (1 — £i)£2 



The proof is given in Appendix 1. Note that 



Lemma 1. Vei, if (c2 — 0.5)(a — (1 — ei)(l — £2) -I- ei£2) < 0 then the bound on 
e in Eqn. satisfies e < ei . 



The lemma simply means that for values of a which actually constraints the 
instances, e < ei. The proof is by direct algebraic manipulation. Lemma. [U 
and Thm. [T] together show that for coherency constrained data, the true error 
of h w.r.t. Cl is lower than the true error in general. Equivalently, one can 
achieve the same generalization using a smaller number of training examples. 
This reduction in sample complexity depends on (1) the degree (a) of coherency 
and (2) the performance (£2) of (the g-map of) h on C2. An important point to 
note is that we do not assume that the learning algorithm knows C2. It is the 
mere existence of this concept, which makes the learning of ci easier. For the 
special case of a deterministic constraint, when a = 1, the number of examples 
required for h to have a true error of £1 with respect to Ci on the constrained 
data, is given by 



In \H\ + In 5 
£1(1-62) 

£l(l-e2) + (l-ei)e2 



(5) 



It is straightforward to see that for 0 < £2 < 0.5, this is a better bound than the 
one given in Eqn. [2 

This case is similar to the one presented in jl]. They have introduced the 
concept of co-training and have shown that the presence of unlabeled data can 
help under some consistency assumptions. They consider the example of labeling 
web pages, and it is argued that in this case two independent concepts exist which 
provide consistent labels. Their example can be mapped to our framework which 
then provides the guarantees missing in [Ij. What is further highlighted by our 

^ This can be relaxed; we can assume that these quantities are different and split the 
region measured by £2 above to two regions, one given h = 1 and the other given 
h — O', then we can define, for the proof, that Pr{h A C 2 ) = max{Pr[c 2 = 0\h — 
1), Pr{c2 = l\h = 0)}. 
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framework is the realization that the mere existence of the constraint makes the 
original learning problem easier (even without using it). 

The importance of this sample complexity result is the following interpreta- 
tion of it. The presence of constraints reduces the number of training examples 
needed in order to achieve a certain generalization performance, relative to a 
constraint-free scenario. Stated differently, it directly addresses one of our con- 
cerns in the introduction: in these situations, one can believe the results of the 
learned predictor even though it was learned using a small number of examples. 
This can also be thought of as the PAC analysis of the case discussed in [TT] . 

4 Equivalence Class Analysis 

In this section we relax one of the assumptions used in Sec. E] Rather than 
assuming a finite hypothesis space we consider the more general case in which the 
hypothesis class is countably infinite and one assumes a probability distribution 
Q over it. For the standard learning model this case has been analyzed in m 
and others. Consider the following example. 

Example 1 Assume one is trying to learn a target function c\ in the presence 
of C 2 and that C 2 coheres with with ci on the observed data. Let Ti, he the class 
of monotone Boolean conjunctions over {0,1}" and assume that Ci,C 2 are also 
in Ti.. We can thus think of C\,C 2 as elements in {0, 1}" with the interpretation 
that Ci(j') = 1 iff the conjunction Xi contains the variable xj (i=l,2; j=l,. ■ ■ n). 
Assume now that ci differs from C2 on k bits. Given the coherency constraint 
then, for all observed instances x S {0, 1}", the corresponding k bits must be zero. 
As a result, all functions in hi which differ only on these hits are equivalent for 
this learning problem. The size of the equivalence class will depend on the number 
of bits k on which ci , C2 differ. 

The assumptions made in the above example can be relaxed in several ways. In 
particular, we can assume coherency with high probability and can still get a 
similar result in terms of an equivalence class with high probability. Some of the 
examples in m can also be analyzed via the equivalence class view. 

Assume that there is some probability distribution Q over the hypothesis 
space. An equivalence class over this hypothesis space would then mean that one 
can consider a smaller effective hypothesis space. Figure [T] shows the probability 
distribution over the hypothesis class. Figure |2] maps the equivalence class view 
over this hypothesis space. That is, all hypotheses belonging to an equivalence 
class are indistinguishable due to the presence of a constraint or, equivalently, 
due to a property of the data. The effective probability distribution assigns to 
the equivalence classes weights which are proportional to their size. Below we 
use the pac-Bayes framework to quantify this and show that this view indeed 
implies tighter generalization bounds. 

We assume a countably infinite hypothesis class TL with known probability 
distribution Q over it and study the effects of coherency constraints. Let C be the 
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Hypothesis Space 

Fig. 1. Probability distribution over the hypothesis space 




Fig. 2. Probability distribution over the hypothesis space. Dotted lines show the equiv- 
alence class, i.e. all the hypotheses that fall in one group of the dotted lines are indis- 
tinguishable on the given data and are assigned probability equivalent to the sum of 
the probabilities of their equivalence class. 



class of hypotheses consistent with the data; c£ C. Using m, the generalization 
error is bounded by 



-(c) < 



qU 



+ lni 



m 



( 6 ) 



Let He Q H, such that \fh G He, the constraint is satisfied with high proba- 
bility. We can think of He as the class of hypotheses that are representatives 
of the equivalence class, although the discussion that follows will apply to any 
projection (filtering) of the hypothesis class. We are interested in solving for 
e(c|c G He), that is, the probability of a consistent hypothesis making an error 
given that it satisfies the constraints. To do that we need Q{c G C\c G He)- 
(I.e., we restrict our learning algorithm to consider only those hypotheses which 
satisfy some constraints). This is a more general case than the one discussed 
in the previous section. As discussed in P!. the constraints have the effect of 
reducing the size of the hypotheses space and this is what is observed here too. 
We first compute the term Q(c|c G He)- 



_ Q(c,c G He) _ Q{c)Q{c G He\c) 
Q{c G He) Q{Hc) 



jQjc) 

QiHeY 



( 7 ) 



where 7 is the probability that a consistent hypothesis belongs to the subset of 
the hypotheses class satisfying the constraint. This leads to the following Lemma: 
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Lemma 2. 



In + In Q{Hc) + In ^ + In i 

e{c\cGnc)<^^ ^ ( 8 ) 

m 

Note that here we are considering a weaker constraint; only with high probability 
a hypothesis consistent with the data satisfies the constraint. This can also be 
seen as modifying the probability distribution over the concept class which is 
governed by the presence of constraints. 



5 VC-Dimension Based Bounds 

In this section we consider the general case where the hypothesis class may 
contain infinite number of hypotheses. For the present analysis we will assume 
that the class has finite VC-dimension. We first introduce the basic principles 
of the VC theory and then develop related results under coherency constraints. 
Finally we discuss applications of these to some realistic learning scenario. 

The VC-dimension based bounds can be intuitively viewed as extensions of 
the bounds for the case of finite hypothesis class (Sec. E| only that the size 
of the hypothesis class is replaced by annealed entropy. The annealed entropy 
is a distribution dependent concept which is then bounded from above by a 
function of the VC-dimension (using Sauer’s lemma) and thus gives distribution 
free bounds. The annealed entropy is given as 

Hann = J X^)dF{x) (9) 

where A^{x^ ,x'^, ...,x^) is the maximum number of dichotomies the sample 
x^ ,x'^ , ...,x^ can have when using a given set of hypothesis. The integral is 
taken over all possible samples of size L thus giving the expected number of 
dichotomies that are possible for a given distribution over the data and a given 
hypothesis class. Given the annealed entropy, the bound on the generalization 
error is given by 



P{sup„gyi |i?(a) - Remp{a)\ > e} 

<4errp{( ^°7 - (e- l} , (10) 

where R{Remp) is the expected (empirical, resp.) risk associated with the target 
function a. The bound is developed in [Hj (Theorem 4.1). It gives the explicit 
dependence of the true error bound on the annealed entropy. The smaller the 
annealed entropy, the tighter the bound is. This can also be thought of as the 
capacity of the hypothesis class as a function of the distribution over the data. 

Indeed, this is a much better bound than the most commonly used VC- 
dimension bound. Consider a hypothesis class TL which consists of all hyper- 
plane in 3?", and assume that the distribution that governs the data generation 
supports only data points on the a;-axis. For this case, while the VC-dimension 
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of the hypothesis class is directly related to the dimensionality of the space 
(n + 1) and is not effected by the distribution of the data, the annealed entropy 
of Ti. is independent of the space in which the data lies, yielding a much better 
and more realistic bound. The VC-dimension can thus be thought of as a func- 
tion of annealed entropy for the worst case probability distribution on the data. 
However, since, in general, computing the annealed entropy is not feasible, the 
VC-dimension bound is commonly used. 

Next we show that one can use limited information on the data distribution 
to obtain the effective annealed entropy. Our goal is related in spirit to the 
one studied in m- While they give a method of calculating the effective VC- 
dimension given observed data we, instead, use similar techniques to bound the 
effective annealed entropy Hann within the coherency constraints framework. 

5.1 A General Framework for Constraints 

This section develops a general framework for modeling the effect of the con- 
straints in terms of the effective annealed entropy; this can then be mapped to 
the effective VC-dimension of the data. Recall the coherency constraints defini- 
tion (HD. Denote 

A C r = {x\g^{c) = 1} A = -A. (11) 

This generalizes the discussion in Sec. El We assume that the constraint is sat- 
isfied only by a particular labeling of the data for A instances, however, for the 
case when cc S A> data can take any label. 

Eqn. inican be written in terms of A, A- The expected value is taken over 
the space of all samples. Since one is looking at the number of dichotomies that 
can be achieved for the given set of samples, this integral is over the L fold 
distribution. The annealed entropy can therefore be written as: 

= / A'"(x^ , ...,x^)dF(yi) + / Zi^(£cb ...,x^)dF(x) (12) 

J J 

Denote P{x G A) = a, the probability that a sample satisfies the constraint. 
The probability that not all the samples came from A is 1 — a^. When all the 
samples are from A, then only one labeling of samples is possible; otherwise a 
large number of dichotomies are possible. We get the following bound on the 
effective annealed entropy: 

Hann < KH = (1 - (13) 

where gives the maximum number of dichotomies of any set of L samples 

using the hypothesis from the given class. This bound gives a much smaller 
value of the effective annealed entropy for small values of L. However, for large 
values of L, goes to zero as does the effect of constraints (in the formulation 
given). Thus, as the number of samples grow, the effect of the constraints as 
played in the simple argument above, on the generalization performance, goes 
down. To understand this, note that, intuitively, with infinite amount of data, 
if the constraints affect only a small portion of the instance space (a is small) 
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the number of “observed” dichotomies will be almost as large as the number of 
possible dichotomies. However, the more interesting case might be that of small 
values of L, and of large a. Note that in the natural examples alluded to earlier 
in the paper, a is large. Also, this analysis is still very general and makes no 
assumptions on the structure of the constraints. Recall, for example, the case 
discussed at the beginning of this section, where all the instances lie on the x- 
axis. It would be desirable to exploit general structural constraints to explicitly 
bound the annealed entropy. These ideas are exemplified in two concrete cases. 

5.2 Highly Biased Class Probability 

In many realistic learning problems the probability of observing positive exam- 
ples is very small relative to that of the negative examples. Consider the problem 
of face detection in Computer Vision. One may see only a few lO’s of positive 
examples and may see millions of negative examples. Similar phenomena occurs 
in many natural language and information extraction learning situations. The 
considerations developed earlier can be used to show that the generalization per- 
formance in these cases is better than predicted by current theories. To show 
that, we will compute the effective annealed entropy. 

Denote by a the probability of the positive class is a, 1 — a is the probability 
of the negative class. Without loss of generality, we will assume that a ^ 1. 
To model the highly biased class probability as a coherency constraint, one can 
think of the equality constrain g{c) with c\ as the target function and C 2 (x) = 0. 
And, we assume that this constraint holds with high probability (1 — a). Using 
the analysis given for Eqn. Ildl we obtain: 

Corollary 1. Assume a highly biased class probability case, with the probability 
of the positive class being a ^ 1. Then the effective annealed entropy for a data 
set of L samples is 

< (1 - (1 - ( 14 ) 

where Harm is the annealed entropy (no assumptions). 

For small values of L, we see that Although as L ^ oo, H 

Harm, argue that the interesting case is when L is not too large, since, 

lim ^ ^ 0 (15) 

L— »oo L 

(This is a simple consequence of uniform convergence as the number of samples 
observed approaches infinity.) 

We note that in this case one can observe the effect of the constraints not 
only as a consequence of the smaller effective annealed entropy but also directly 
by looking more closely on the form of Chernoff bound. In general, the binary 
classification problem is modeled as the convergence of the observed frequencies, 
in a Bernoulli experiment with mean p, to the true frequencies. The standard 
formulation used for the bound is that of Hoeffding Bound: 

P{S> {p + e)m) < 
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which gives a bound which is independent of p. However, an exact analysis of 
the Chernoff bound for the Bernoulli case results in the tighter bound: 

Lemma 3. 

P{S >{p + e)m) < ^-n(i-e-p} log 

This can be easily verified using the standard definition of the Chernoff bound. 
Fig. [^a) compares the standard bound given above to the tighter one given by 
Eqn. [TBl as a function of the class probability. It is evident that the bound is 
significantly better for small values of p. 




(a) (b) 



Fig. 3. (a) The dotted line gives the bound for the standard Chernoff bound; the solid 
line is the tighter version of the bound, (b) Learning Curve for the noiseless case. The 
dotted curve shows the learning case in the presence of the constraint and the solid 
lines show the learning curves of the individual classifiers in the absence of constraints. 



5.3 Linear Mapping to Higher Dimensional Space 

As a second example consider learning a linear classifier for a data lying in a high 
dimensional space (say M). Due to the high dimensionality it is very likely that 
the training data is linearly separable. In fact, in many natural language and 
visual learning problems the dimensionality is larger than the number of training 
instances. The basic question is to understand the generalization properties of 
the resulting classifier, given that it was learned based on a small number of 
examples relative to the dimensionality. 

For simplicity, we assume that there exists a one-to-one mapping of the M di- 
mensional data to a lower dimensional space, N, through a linear transformation. 
In this case, we show that the data is linearly separable in the N dimensional 
space. (E.g., think of a case in which the the data is originally in a lower di- 
mensional space N but is being observed in a higher dimensional space.) and 
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the generalization performance is thus governed by it. Notice that even in this 
case, the problem of recovering the transformation matrix and using it to map 
the data back to the lower dimensional space is intractable. However, our claim 
is that this is not necessary and learning in the high dimensional space does 
not require to see more data. To see that, let x = {xi,X 2 , ...jXm) be a training 
example and h = (hi, h 2 , ..., Hm) the linear classifier in the higher dimensional 
space. Denote z = {zi, Z 2 , ■■■, z^) the data point in the N dimensional space 
such that x = Az where A is the (unknown) M x N transformation matrix. The 
outcome of the classifier is y = h^x, which can also be written as 

y = h*x = h*Az = = {h )*z 

That is, there exists a linear classifier h in the lower dimensional space that will 
achieve the same performance as h in the higher dimensional space. The idea 
is that one doesn’t need to know either A or h . This scenario is also directly 
representable in the coherency constraint framework, as in Def. H] To do that, 
let Cl = h, the target function in the M dimensional space, C 2 = h o A~^, and 
let the constraint be the equality constraint. Clearly, this simple scenario can 
be relaxed to the full generality of the definition, but the outcome is essentially 
the same. That is, there is no need to recover the transformation, but rather the 
fact that it exists implies that the generalization properties are as good as they 
could be in the lower dimensional case. The constraints can also be used directly 
to show that the VC-dimension in this case is actually iV + 1 and not M + 1 as 
was originally thought. 

Recent work on random projection also makes use of the same idea and a 
number of tighter bounds have been proposed m- As has been pointed in m, 
coherency constraints may also imply increased margin when the classification 
is done using a linear hyperplane. Applying ideas from the theory of Random 
Projection, this implies that one can project the data to a lower dimension, with- 
out compromising the performance (with high probability). Our recent work jS] 
makes use of these ideas to develop tighter bounds by analyzing the data in this 
reduced space. 

6 Experiments 

We describe some preliminary experiments used to exhibit and evaluate the 
implications of the insights gained in this work. We considered the problem 
of learning a half space in the presence of another, constraining, half space. 
Specifically, data was sampled from an n dimensional space, but the (randomly 
chosen) classifiers ci,C 2 actually depend only on n/2 dimensions: c\ depends 
on x\, . . .Xn/2 and C2 on Xn/2+1, ■ ■ - Xn- We show learning curves for learning 
Cl given data sampled uniformly from Ift", and also of learning ci when the 
data observed is filtered to satisfy the equality constraint, that is, for all input 
instances x, ci(x) = C 2 (x). (For completeness we also show the curves for C 2 .) 

Figure Elb) shows the learning rate, for the noise free case, with and without 
the constraints. We used data in 3?^^), and tested on 1000 examples. The curves 
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give the errors as a function of the number of training examples; the solid curve 
- for the individual half-spaces and the dot-dash curve for learning in presence 
of the equality constraint. 




Fig. 4. Learning Curve for the noisy case. The dotted curve shows the learning case 
in the presence of the constraint and the solid lines show the learning curves of the 
individual classifiers in the absence of constraints. 



Fig. [31 depicts the results of the same experiment for noisy data (this time, 
data lied in 3?^°). It is clearly evident from the learning curves shown that the 
classifier is able to learn much faster in the presence of the constraint. As we 
have pointed out throughout the paper, the learning algorithm is unaware of the 
existence of form of the constraints. 

7 Conclusions 

The power of existing models of learning mm stems from the distribution- 
free nature of the model. The underlying assumption is that the probability 
distribution governing the occurrences of instances is too complex to model and 
a theory should be developed without making explicit assumptions on it. The 
resulting theories, however, cannot explain well a wide range of phenomena, in 
which learning can be done robustly from a relatively small number of examples. 

In this work we have developed a learning model within which we attempt 
to explain these phenomena. The key observation underling this model is that, 
in many situations, learning problems do not occur in isolation. Our model is 
therefore concerned with learning scenarios where multiple learners co-exist but 
there are mutual coherency constraints on their outcomes. Within this model, 
we have developed generalization bounds and have shown that in the presence of 
coherency constraints the learning problem indeed becomes easier than predicted 
by the general theories. This could explain the ability to generalize well from a 
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fairly small number of examples and can help in understanding several realistic 
learning situations. 

While several works (e.g., m) have criticized the distribution free pac learn- 
ing model as being too restrictive this work still pursues the distribution free 
approach (see discussion in Sec. EJ. In some sense, our model can be viewed as an 
intermediate model between the worst case distribution free model that is com- 
monly studied in learning theory and the simpler, but unrealistic, distribution 
specific model (in which one assumes a complete knowledge of the distribution, 
and can utilize it when learning). We assume, instead, a distribution free model 
in which the distribution could be constrained in natural, but intricate ways. 
The learner is unaware of this. This view opens up a number of questions; in 
particular, an interesting direction could be to understand generalization under 
specific families of constraints. 

References 

1. J. Amsterdam. Some philosophical problems with formal learning theory. In Na- 
tional Conference on Artificial Intelligence, pages 580-584, 1988. 

2. R. I. Arriaga and S. Vempala. An algorithmic theory of learning: Robust concepts 
and random projection. In Proc. of the 4 0th Foundations of Computer Science, 
1999. 

3. G. Benedek and A. Itai. Learnability with respect to fixed distributions. Theoret. 
Comput. Sci., 86(2):377-389, 1991. 

4. A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. 
In Proc. of the Annual ACM Workshop on Computational Learning Theory, pages 
92-100, 1998. 

5. A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Occam’s razor. 
Information Processing Letters, 24:377-380, April 1987. 

6. C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20:273-297, 
1995. 

7. H. Druker, R. Schapire, and P. Simard. Improving performance in neural networks 
using a boosting algorithm. In Neural Information Processing Systems 5, pages 
42-49. Morgan Kaufmann, 1993. 

8. A. Garg, S. Har-Peled, and D. Roth. Generalization bounds for linear learning 
algorithms. Technical Report UIUCDCS-R-2001-2232, University of Illinois at 
Urbana Champaign, June 2001. 

9. A. R. Golding and D. Roth. A Winnow based approach to context-sensitive spelling 
correction. Machine Learning, 34(1-3) :107-130, 1999. Special Issue on Machine 
Learning and Natural Language. 

10. D. A. McAllester. Some PAC-Bayesian theorems. Machine Learning, 37(3):355- 
363, 1999. 

11. D. Roth and D. Zelenko. Towards a theory of coherent concepts. In National 
Conference on Artificial Intelligence, pages 639-644, 2000. 

12. L. G. Valiant. A theory of the learnable. Communications of the ACM, 
27(11):1134-1142, November 1984. 

13. V. Vapnik, E. Levin, and Y. Le Cun. Measuring the VC-dimension of a learning 
machine. Neural Computation, 6(5):851-876, 1994. 

14. V. N. Vapnik. Statistical Learning Theory. John-Wiley and Sons Inc., New York, 
1998. 



150 A. Garg and D. Roth 



8 Appendix 

Proof (Proof of Thm.\^. Given the discussion, before the beginning of the the- 
orem, it is sufficient to prove the theorem for the case of g being the equality 
constraint. We denote by G the constraint set. I.e. x G G implies that the sam- 
ple X follows the constraint. Also denote G = {x\ci{x) = C2{x)} and ~^C its 
compliment. Similarly H( = {x\h{x) = Ci{x)}. 

The true error of h w.r.t. ci on a sample satisfying the constraint is given by 

P{h(x) / ci(x)\x G G) = P(^hI\x G G) = P{^h(,C\x G G) + P(^hI,^C\x G G) 

= P(^hI\C,x G G)P(C\x G G) + P{^h(\^C,x G G)P(^C\x G G) 

= Pi^Hl\G)P{C\x G G) + P{^hI)\^C)P{^C\x G G) 

_ P{^Hl,G)a ^ P{^Hl,G){l-a) 
pfC) + P(^C) 

^ P(-.Hl,-.Hl)c P(-.Hl,Hl)(l-a) 

P{G) Pi^G) 

_ £i£2a ei(l - e2)(l - a) _ 

(1 - ei)(l - £ 2 ) -I- £i£ 2 (1 - ei)£2 -I- ei(l - ^ 2 ) ~ 

The fourth equality follows from the fact that conditioned upon the C or 
whether h{x) agrees with Ci{x) or not is independent of whether x G G. The 
reason is that the effect of the constraint is simply in determining the probability 
of G. The sixth equality is due to set equality (e.g., D C = n 
The seventh equality uses the fact that decisions made by concepts ci,C2, are 
independent of each other given h. To see it more specifically, one has to go 
through a series of probabilistic inequalities. Let H refers to the set of all x, 
such that h{x) = 1 and ~^H refers to the set of x such that h{x) = 0. 



p(-.hI,-.hI) = 

— P(ci(x) ^ h{x), C 2 (x) h{x)\H)P(H) + P{ci{x) h(x), c^^x) h{x)\^H)P(^H) 

— P(ci(x) ^ h{x)\H)P{c2{x) ^ h{x)\H)P{H) + P(ci(x) ^ h{x)\^H))P(c2(x) ^ h(x)\^H)P(^H) 
= P(c2(x) + h{x)\H){Pici(x} # h{x)\H)P(H) + P{ci{x) ^ h{x)\^H)P{^H)} 

= £l£2 



Where we have used the fact that C2 is a symmetric concept with respect to 
h (i.e. P{c2{x) = l|/i(a;) = 0) = P{c2{x) = 0|/i(x) = 1),) which means that one 
can write P{c2{x) = l|/i(x) = 0) = 62. Using the same argument, one can derive 
the other two term P{C) and the term P{^Hf, Hf). The analysis of the latter 
follows exactly the same step as given above. In case of analysis of P{c), one 
can write it as a summation of four terms (two conditioned upon h{x) = 1 and 
other two conditioned upon h{x) = 0) and can then follow the same argument as 
above. We have also used the fact that P{x : Ci{x) = C2{x)\x G G) = a (Eqn.jS]). 
This proves the theorem. 
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Abstract. In most concept learning problems considered so far by the 
learning theory community, the instances are labeled by a single unknown 
target. However, in some situations, although the target concept may be 
quite complex when expressed as a function of the attribute values of 
the instance, it may have a simple relationship with some intermediate 
(yet to be learned) concepts. In such cases, it may be advantageous to 
learn both these intermediate concepts and the target concept in parallel, 
and use the intermediate concepts to enhance our approximation of the 
target concept. 

In this paper, we consider the problem of learning multiple interrelated 
concepts simultaneously. To avoid stability problem, we assume that 
the dependency relations among the concepts are not cyclical and 
hence can be expressed using a directed acyclic graph (not known to 
the learner). We investigate this learning problem in various popular 
theoretical models: mistake bound model, exact learning model and 
probably approximately correct (PAG) model. 

Keywords: multiple concepts, mistake bound algorithm, exact learning, 
PAG learning, membership queries 



1 Motivation 

In a typical concept learning problem, an instance x = (xi,...,Xn) is classified 
according to a single unknown target concept /. The learner’s task is to find a 
good hypothesis h that approximates /. In some practical situations, the target 
concept may be very complex when expressed as a function of only the attribute 
values of the instance. However, it may be expressible in a simpler form using 
some intermediate concepts in addition to the attributes of the instance. 

As a pedagogical example, suppose you want to predict tomorrow’s perceived 
temperature, which can be different from the air temperature depending on to- 
morrow’s humidity, wind and other factors. However, you can only measure the 
current weather indicators. It may be difficult to predict the perceived tempera- 
ture directly from these indicators as they may bear a complex relationship with 

N. Abe, R. Khardon, and T. Zeugmann (Eds.): ALT 2001, LNAI 2225, pp. ISl-jlOO] 2001. 
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the future perceived temperature. However, the prediction task may be made 
easier if we were to learn to predict tomorrow’s humidity, wind pattern and 
other factors (which may have a simpler relationship with the current weather 
indicators), and then use them to predict tomorrow’s perceived temperature. 
In such scenario, it may be advantages to also learn the intermediate concepts 
(various weather indicators) and use these intermediate predictions to enhance 
our approximation of the target concept (perceived temperature). Similarly, it 
may be easier to predict humidity level if we can also learn to predict wind speed 
and direction, air pressure and other indicators. 

Another example found in |Car98j is the problem of predicting mortality risk 
from diseases like pneumonia. The goal here is to identify patients accurately 
and economically so as to decide whether they need to be hospitalized to receive 
aggressive treatment. Note that the goal is not to diagnose pneumonia as the 
diagnosis has already been made. Here, we (or rather the Home Managed Care 
Organization) are interested in determining how much risk the illness poses to the 
patient and whether the cost of aggressive treatment is necessary. Unfortunately, 
the task is made complicated by the fact that many of the useful but expansive 
tests (like white blood cell count. Hematocrit test. Potassium count, etc) for 
predicting pneumonia risk are performed only after one is hospitalized. It would 
be more cost effective if we can separate the low-risk patients from the moderate 
and high-risk patients through the cheaper measurements (like age, sex, diabetic, 
asthmatic, chest pain, wheezing, blood pressure, temperature, hear murmur, etc) 
made prior to admission to hospital. One possible way of enhancing pneumonia 
mortality rate prediction is to approximate the expansive test results using the 
initial low-cost measurements, and then learns how mortality rate depends on 
both test results and initial measurements. Further plausible applications in 
the medical and image processing domains can also be found in |Ca,r9BICPT97l 
ICa,r98inHR95ISK9niSH9Tl . 

In this paper, we study how to exploit these intermediately related concepts 
to enhance prediction accuracy. Although the problem of learning multiple re- 
lated concepts simultaneously has not been well investigated in the learning 
theory community, the problem has been extensively studied empirically in the 
neural network community |Car9B|CPT97|Car98|DHHH5)SKHniSH91| . Instead of 
having a neural net with a single output node learning a single concept, these 
empirical studies showed that better results can be achieved by having a neu- 
ral net with multiple output nodes each trying to learn a different, but closely 
related, concept. Subsequently, Baxter |Bax95lBax97j provided theoretical justi- 
fications of this phenomenon. The main theme for this type of research is that by 
learning multiple closely related concepts simultaneously, the learner is better 
at constructing useful features (the activation functions of the hidden notes). 
The difference of our work here is that these earlier results did not assume that 
there is a dependency relationship among the concepts. Further, in our case, we 
assume the values of the interrelated concepts are specified in the label. 
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2 The Intermediate Concepts Learning Model 

To make the learning problem more general and easier to analyze, we assume 
that the learner’s task is to learn a collection of concepts T = {fi, ■■■, fk}, 
instead of a single target concept. As in standard concept learning, we assume 
that each concept in T comes from a given concept class C . Further, we assume 
that each concept in J- may depend on the other concepts in T , in addition 
to the attributes. To avoid circular dependency, we also assume that there is 
a linear ordering /j,, of T such that each concept /j. is defined by the 

attributes X = {xi, ...,Xn} of the instance and X = (x), ..., (x)}, but 

not {fj.^j^(x),...,fj^.{x)}. However, the learner does not know which are the 
intermediate concepts in X that each fi is dependent on. In other words, the 
learner does not know the (linear) ordering fj^. 

Often a concept class that is not very expressive can be used to represent 
complex concept via this notion of intermediate concepts. For example, a DNF 
formula can be viewed as a disjunction of some intermediate concepts where 
each intermediate concept is a conjunction of some attributes of the instance. In 
our learning setting, we are learning the class of disjunctions union conjunctions. 
Here, we are assuming that the label of an instance x contains the values of the 
individual terms in addition to the value of the target DNF. Therefore, although 
Theorem (see Section implies that disjunctions union conjunctions can be 
learned efficiently (in the mistake bound model), it does not mean that DNF 
can be learned efficiently. Theorem [7l |8] and |9] provide further examples of how 
a complex concept can be represented as a simple concept using a collection of 
intermediate concepts. 

In this model, the intermediate concepts are treated as Boolean values in 
those concepts that they affect. That is, in measuring the complexity (i.e. length) 
of a concept represented in terms of both attributes and intermediate concepts, 
we do not expand the representations of the intermediate concepts. The latter 
would defeat the motivation of this p^er. Further, we assume that the interme- 
diate concepts are somewhat ‘related|j, which in practice entails that the size of 
T is polynomially bounded by the number of base attributes X. 

The relationship among the concepts can be described by a dependency DAG 
G{X, X) where the nodes are labeled using XUJ- where X is the set of attributes 
{a:i, ...,Xn}- There is an arc (a, 6),o G XLiX,b £ X in G(X,T) if and only if the 
value of h is directly dependent on a. Clearly, the graph does not contain any cycle 
due to the ordering of T . The depth of G is the length of the longest directed path 
in G{X,T\ The level l{fi) of a function fi is the length of the longest directed 
path in G{X,T) that ends in fi. Although in practice, a learner is typically 
interested in learning a subset of IF, for the sake of simplicity, we assume that 
the learner is interested in learning all the functions in T . We consider this 
intermediate concepts learning (ICL) problem under various popular concept 
learning process. 

^ The notion of relatedness is somewhat subjective. For further discussion on this 
issue, we refer the reader to Caruana |Sal98]. 
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Littlestone’s [Lit88| mistake bound process: In this process, learning pro- 
ceeds in a sequence of trials. In each trial, the learner is given an instance 
X = {xi,...,Xn) in each trial. The learner’s task is to predict the classifica- 
tion of X, according to some target function / in a given concept class C. In 
return, the learner receives a feedback on whether the prediction is a mis- 
take. For the intermediate concepts learning problem, instead of predicting 
a single concept, the learner has to predict the values of all the functions in 
!F on X. The objective is to make as few mistakes as possible in any, possibly 
infinite, sequence of trials. When some of the intermediate predictions are 
wrong in a trial, we count it as one mistake (not multiple mistakes) . 

Valiant’s [Val84] probably approximately correct (PAC) process: 

In the single target concept version of the PAC model, the learner’s goal is 
to infer an unknown target concept c from some class of concepts C . To 
obtain information about c, the learning algorithm is provided access to 
labeled (positive and negative) examples of c, drawn randomly according to 
some unknown probability distribution D over the instance space X. The 
learner is also given two parameters e and 5 as input. The learner’s goal is 
to output, with probability at least 1 — 5, the description of a concept d 
that has probability at most e of disagreeing with c on a randomly drawn 
example from D (thus, d has error at most e). If such a learning algorithm 
A exists (that is, an algorithm A meeting the goal for any target concept c, 
any target distribution D and any e, 5 > 0), we say that C is PAC-learnahle. 
We say that a PAC learning algorithm is a polynomial-time (or ejficient) 
algorithm if the number of examples drawn and computation time are 
polynomial in n, 1/e, 1/5, and perhaps other natural parameters. 

We extend the standard PAC model to the ICL environment by requiring 
the learner to output with probability at least 1 — 5, a set of hypothesis 
{hi,...,hfc} such that the probability that for all i, the probability that at 
least one of the hypotheses hi disagrees with fi on a randomly drawn example 
from D is at most e. 

Angluin’s [Ang88| exact learning process: Here, the learner is given ac- 
cess to some oracles and the learner’s task is to exactly identify the target 
concept /. The two most commonly used oracles are equivalence and mem- 
bership query oracles. In a single concept learning environment, an equiv- 
alence query oracle takes a hypothesis h from some hypothesis class H as 
input. If h is not equivalent to /, then it returns a labeled counterexample 
which h and / classify differently, otherwise it returns yes. If the hypothesis 
class H and the concept class C are the same then the equivalence query 
oracle is said to be proper, and improper otherwise. A membership query 
oracle takes an arbitrary instance x of the learner’s choosing as input and 
returns its classification f{x). A concept class is said to be efficiently learn- 
able in the exact model with the given oracles if the target concept can be 
identified in time (and hence the number of queries posed) polynomial in 
n, the size of the target concept and possibly other natural parameters. For 
the ICL problem, the learner is supposed to identify all the concepts in T . 
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Here, a membership query oracle takes an instance x as input and returns 
- Jkix)). 

In the above adaptations of the traditional models, we count multiple errors 
in predicting /i(cc), ..., /^(a;) as one error for the sake of making the analysis 
simple. Some of the bounds obtained in this paper also apply when we do not 
count these errors as a single error but multiple errors. Since single concept 
learning is a restricted case of multiple concepts learning, we have the following 
trivial observation. 

Observation 1 Efficient learning of intermediate concepts in a concept class 
C simultaneously implies efficient learning of C in the standard single concept 
learning setting. 

The main theme of this paper is to show that the converse of Observation [T] 
holds for the three models when membership queries are not allowed. However, 
Theorem d [HI and O in Section 0 show that the converse may not be true when 
membership queries are allowed. Some ideas from this theoretical research have 
been implemented and studied empirically using synthetic data. The results seem 
to be promising. We plan to run our experiments on other standard benchmark 
data sets and applications found in |Ca,r96ICPT97ICa,r98inHH95ISK9niSHm] . 
Another plan is to adapt our algorithms to deal with missing attribute values 
and perform experiments on UC Irvine database. 



3 Efficient Algorithms for Mistake Bound Learning 



For the online mistake bound model, we have the following results. 

Theorem 2. Suppose there is an online algorithm A that learns a single target 
concept from some concept class C using some monotone hypothesis class C with 
mistake bound M . Then there is an algorithm that learns multiple intermediate 
concepts T = {/i, ■■■,fk} C C with mistake bound kM. 

Proof: The learner runs one copy of A, denote here as Ai, to learn each 

function fi in iF. We denote the hypothesis maintained by Ai by hi. The learner 
assumes that fi{x) depends on X U F{x)\{fi{x)}. 

To make a classification on an instance x = {xi, ..., xA), we assume ini- 
tially that each fi{x) is ‘O’. To avoid confusion with the actual value of fi{x), 
let us denote this assumed value as yi. With these assumed values, if each 
hi{x\, ...,Xn,yi, --nyk) Is the same as its corresponding assumed value ?/i, then 
the learner stops and predicts hi{x \, ..., Xn, yi, ■■■, yk),i = !> k. Otherwise, the 
learner update yi to hi{xi , ..., a;„, j/i, ..., yA, and continue to attempt to make its 
classification. Note that since the concept is monotone, yi can only flip from ‘0’ 
to ‘1’, but not vice versa. Thus, the learner needs at most k iterations for the 
output to stabilize and output its prediction. 
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If the learner makes a mistake, it will receive the correct labeling {fi{x), 

fk{x)) of X. The learner updates its hypotheses by checking for each hi, if 
hi{xi, ...,Xn, fi{x), fk{x)) = Mx). If not, then we feed {xi, ...,Xn, fi{x), 
fk{x)) as a counterexample to Ai which will then update the hypothesis hi. 
(Note that none of the Ai will receive a wrong counterexample.) 

Suppose there is at least one hypothesis that makes a false positive mistake. 
Then among all such hypotheses, the one that makes the false positive predic- 
tion in the earliest iteration will surely be updated. This is because the target 
functions are monotone, and the other false-positive mistakes do not affect this 
particular false positive prediction (since they are made in later iterations) . Sim- 
ilarly, if the mistakes are all false negative mistakes then due to monotonicity, 
all the hypotheses that make mistake will get updated. Note that the mistake 
bound in Theorem [2] can be lowered further by repeatingly classifying x and 
performing the above update, until all the predictions h\{x), ...,hk{x) are cor- 
rect. □ 

Unfortunately, the proof does not apply when the hypothesis class is not 
monotone. For non-monotone hypothesis classes, we have the following corre- 
sponding theorem where the mistake bound also depends on the depth of the 
dependency DAG. 

Theorem 3. Suppose there is an online algorithm A that learns a single target 
concept from some concept class C with mistake bound M . Further, suppose 
the depth of the dependency DAG is d. Then there is an algorithm that learns 
multiple concepts T = {fi,---,fk} C C with number of mistakes bounded by 
k^dM. Note that we do not require that the hypothesis class and concept class to 
be the same. 

Proof: The learner runs one copy of A, call it Ai, to learn each function fi. 

(See Figure [H) Initially, Ai makes an initial guess h of the actual level l{fi) to 
be 1. That is, the functions only depend on the variables X. When the number 
of mistakes made by Ai exceeds M, then we rerun Ai, setting the mistake count 
to 0 and increment h by one at the same time. 

Suppose a mistake is made on fi{x). If we predict all those values fj{x) with 
Ij < li correctly then we feed a; as a counterexample to Ai so that its hypothesis 
can be updated. Otherwise, we do not update the hypothesis maintained by Ai. 
Intuitively, we are changing the cost of the mistake made to some function at 
lower level. Further, if fi is at level I and there is another function fj whose level 
is incremented from / — 1 to Z, then fj may be irrelevant to fi. The hypothesis 
hi constructed by Ai may contain fj as a relevant variable and we will have 
to discard hi and rerun Ai. This is because we may not able to evaluate fi(x) 
without knowing fj{x) which is currently at the same level. 

Our estimate of l{fi), will not exceed the actual l{fi). This can be proven 
easily by induction on the l{fi). This is clearly true if /(/*) = 1 in which case, 
we are simply learning a concept with domain X. For l{fi) > 1, we simply 
note that once k reaches l{fi), all the estimates of the levels of the relevant 
intermediate concepts of fi^ are lower than h. Hence, at this stage, the values of 
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initialize: 

for i € {1, • ■ ■■, fc} 

k ■- 1 

run a copy Ai of A 

initialize the hypothesis hi in Ai 

set mistake count for hi, rrii ~ 0 

predict x = (xi, • ■ Xn): 
relevanto := X 
for I 1 to d 

relevanti := relevanti-i 
for each hi such that k = I 

predict fi{x) = hi{relevanti-i) 
relevanti := relevanti U {/i(x)} 

update: 

find the smallest I s.t. 3hi where I — h and /i(x) ^ hi(relevanti-\) 
for all hi where I — h and fi{x) ^ hi(relevanti-\) do 

feed (xi, ■ • ■■,Xn, /i(x), ■ • ■ • fk{x)) as counterexample to Ai 

mi := mi + 1 

if mi > M 

restart Ai with mi 0 
li := L + 1 

for all hj s.t. Ij = I + 1, restart Aj with mj = 0. 



Fig. 1. Transforming a single concept learning algorithm to one that learns multiple 
intermediate concepts 



these relevant intermediate concepts in a counterexample fed to Ai are correct. 
Thus, the counterexamples are all valid and we make at most M mistakes. The 
number of mistakes made by Ai when the estimate of l{fi) remains at some value 
smaller than l{fi) is at most kM. Thus, the total number of counterexamples 
needed to learn a single concept is kdM. Therefore, the total number of mistakes 
made is bounded by k'^dM. □ 

Clearly, the algorithm in Theorem Elis not very efficient as we need to rerun 
Ai each time a function moves to the same level as fi . We show in the following 
theorem that for the class of decision lists, this need not be necessary. A decision 
list is a linearly ordered sequence of nodes {{l\, a \), ..., (/g, as)) where the Us are 
literals and ai G {+,—}. To predict the classification of an unlabeled instance x, 
we traverse the list from the front until we reached a literal U which is evaluated 
to true, and predict the label of x as a^. If none of the literals in the list is 
evaluated to true, then x is labeled using some default label. 

Theorem 4. The class of decision lists can be learned exactly in the ICL setting 
by making at most 

dk2{n + fc)^ 
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equivalence queries. Here, \fi\ is the length of the decision list fi, and d is the 
depth of the decision DAG. 

Proof: We begin by reviewing Ri vest’s algorithm for learning a single de- 
cision list |Riv ill with p, variables. Rivest’s algorithm maintains an ordered 
collection of decision sublists h = {C\, ..., Cfj,, false) . The actual hypothesis is 
obtained by concatenating the sublists into a decision list. Initially, C\ is a list 
that contains all the 2/i possible nodes while the rest of the sublists are empty 
(i.e. false). When a prediction mistake is made by a node, say in sublist Ci, then 
the node is moved to the end of the sublist Ci+\. It is trivial to show that if the 
target decision list is of length I then the number of mistakes made is at most 
l2p.Note that since the learner does not know I and thus can only be sure that 
the learning is complete after making 2p^ mistakes. 

The proof is a very simple modification of the the algorithm described in 
Theorem |3] We use Rivest’s decision lists learning algorithm |Riv87| as the base 
learning algorithm in the proof of Theorem[3l However, we do not need to restart 
Ai when there is a function fj that moves to the same level as fi. Instead, we 
simply remove the four nodes with label fj or fj from our hypothesis of fi. 
This new hypothesis is the same one as if the learner was told beforehand that 
fj does not affect fi and encounter the sequence of trials obtained from the 
original sequence of trials by ignoring those trials that the learner made a wrong 
prediction using one of these four nodes. This is because a node is moved from 
sublist Ci to sublist in the original learning process if and only if it is moved 
from sublist Ci to sublist Ci+i in the latter scenario. 

As the learner does not need to restart Ai, the bound obtained in Theorem |3] 
can be trimmed by a factor of k. □ 

We show in the following that for the concept class of conjunctions union 
disjunctions, we do not even need to restart Ai whenever fi moves one level up. 
Clearly, if the class is simply disjunctions (or conjunctions), then we can treat 
each function in as a disjunction (conjunction) and learn IF individually. How- 
ever, with disjunctions union conjunctions, the functions in IF become Boolean 
circuits with and-or gates, which are difficult to learn individually. 

For each function fi, either fi or fi is a disjunction. We maintain a pair of 
disjunctions hf and h~ as our hypotheses of fi and fi. That is, hf{x) = + 
signifies that we should classify x as positive, while h~{x) = + means we should 
classify x as negative. Initially, both hf and h~ are disjunctions of all the literals 
formed by X (but none of T) and are assigned a level of 1. Each hf (h~) 
maintains a list of forbidden functions, initially empty, that are the subset of A 
which we are sure do not appear in fi. We predict fi{x) using the pair {hf , h~) 
and update {hf ,h~) according to the following two cases. 

Case 1: if hf{x) = — and h~{x) = —. We predict according to the hypothesis 
that has the smallest mistake count. Regardless of whether our guess is 
correct, we keep incrementing the estimate of ?(/*) (and hence the levels of 
hf and h~) until there is some function fj at one level lower that is not in 
the forbidden list. We add the these functions and their negations as literals 
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to both /ij and hf . We call the literals formed by iF or their negations, 
functional literals. 

Case 2: Either hf{x) = + or h~{x) = +. We make our prediction according 
to which of hf{x) and h~{x) is When both of them predict we 
pick the one with the smallest mistake count. Suppose we make a mistake. 
Say we predict “+”(i.e. hf{x) = +.) In this case, we remove from hf all the 
literals, including functional literals, that agree with {x, fi{x ), ..., fk{x)). We 
then add to the hf's forbidden list those functional literals that are set to 
true. 



Theorem 5. Suppose the target functions T is a set of k conjunctions or dis- 
junctions with the dependency graph having depth d. Then T can be learned 
exactly by making at most 2k{2{n + fc) + d) mistakes or equivalence queries. 

Proof Sketch: Let us consider an arbitrary function fi in T . Without lost of 
generality, assume that fi is a disjunction and not a conjunction (otherwise, the 
same argument holds if we assume that we are learning fi, which is a disjunction.) 
We have the following straightforward facts. 

Fact 1: Any relevant attribute from X that appears in fi will not be removed 
from h+. 

Fact 2: An intermediate concept (or its negation) that appears in the forbidden 
list of hf definitely does not appear in fi. 

Fact 3: When a Case 2 mistake is made, at least one literal in X is eliminated 
or one functional literal is added to the forbidden list. 

Fact 4: The level of (hf,h~), i.e., our estimate of l(fi), will not exceed the 
actual l(fi). This can be proved easily by induction on l(fi). This is clearly 
true if l(f^) = 1 in which case, we are simply learning a disjunction over the 
attributes X. For l(fi) > 1, we simply note that once li reaches l(fi), all the 
estimates of the levels of the relevant intermediate concepts of ff are lower 
than li and hence will appear as literals in ff~. 

Fact 4 implies that in learning ff , Case 1 update is performed at most l{fi) 
times. Facts 1, 2 and 3 imply that after at most 2{n + k) Case 2 updates, ff is 
exactly determined. Thus, the number of mistakes made by predicting according 
to hf is at most 2{n + fc) + d. Further, once the number of mistakes made by 
predicting wrongly according to h~ exceeds 2(n + fc) + d, the future predictions 
of hf precede those of h~ . Thus, the number of mistakes made in predicting 
fi{x) is at most 2k {2{n + fc) + d). □ 

4 Exact Learning 

We show in Section^that learning results in the mistake bound can be translated 
to our ICL model. Since an online algorithm with mistake bound M is essen- 
tially equivalent to an exact learning algorithm using M -\-l (possibly improper) 
equivalence queries, we have the following corollary. 
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Observation 6 The results presented in Section Q holds for the intermediate 
concepts exact learning model with a (possibly improper) equivalence query ora- 
cle. 



However, it is not clear whether efficient learning algorithms that make use of 
membership queries can be translated to efficient algorithms in the ICL settings. 
The difficulty here is that we have no control over the values of the intermediate 
target functions on an instance. That is, values of the attributes X of an instance 
completely determine the values of T . This prevents the learner from posing an 
arbitrary membership queries on instance where not only all the attribute values 
are specified, but all the function values, except one, are also specified. This is 
because the specified function values may not be the same as the actual function 
values determined by the attributes. 

Even for the fundamental algorithm of Angluin for learning monotone DNFs, 
it is not clear how we can translate it to learn multiple monotone DNFs. One 
immediate observation is that each function in J- can be expressed as a mono- 
tone DNFs over X only. It seems that we can simply learn each function as 
a monotone DNF over X without regard to the intermediate concepts. How- 
ever, the representation size may be exponential in the size of the represen- 
tation using both X and T . For example, consider |A| = {k — l)m. Suppose 
Vi G {1, ..k - 1}, /* = V ... V Xim and = fi A ... A fk-i then the 

DNF representation of fk using only X has terms! 

In the presence of equivalence and membership query oracles, unate DNFs 
(where each attribute does not appear as both positive literal and negative literal 
in the target DNF) [AHK93] . horn DNFs (where each term can have at most 
one negated literal) |AFP92J . read- once formulas (where each variable appears 
exactly once) |AHK98j and ordered binary decision diagrams (decision DAGs, 
a.k.a. branching program, such that the labels along any directed path respects a 
linear ordering of the attributes) [GG95] have been shown to be learnable in the 
single-task learning setting. However, the following theorems |3|8]and[9]show that 
efficient learnability of these classes in the intermediate concepts setting would 
imply efficient learnability of DNFs. The latter is one of the more challenging 
problem in learning theory. 



Theorem 7. Efficient learnability of unate DNFs and Florn DNFs, using equiv- 
alence and/or membership queries in the intermediate concepts setting would 
imply efficient learnability of DNFs using equivalence and membership queries 
in the single concept setting. 

Proof: Suppose unate DNFs are learnable in the intermediate concepts learn- 

ing setting with membership and equivalence queries. Then we can use the unate 
DNFs learning algorithm to learn a DNF boolean formula /, by introducing in- 
termediate concepts f 2 i-i{x) = Xi and f 2 i{x) = xi for each attribute Xi. Any 
boolean DNF can then be expressed as a monotone DNF using these interme- 
diate concepts. Since we know what these intermediate concepts are, it is clear 
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that we can simulate the equivalence and membership query oracles for the in- 
termediate concepts learning problem by using the standard equivalence and 
membership query oracles for /. The same argument also holds for Horn DNFs. 

□ 

Theorem 8. Ejjicient learnability of read-twice DNFs, read- once formulas and 
ordered binary branching programs using equivalence and/or membership queries 
in the ICL setting would imply efficient learnability of DNFs using equivalence 
and membership queries in the standard single concept setting. 

Proof: Without loss of generality, suppose we know k the maximum number 

of times the same literal appears in the target read-twice DNF in the single-task 
learning problem. Then we simply introduce k intermediate concepts that are 
equivalent to Xi and k intermediate concepts that are equivalent to x^. Clearly, 
any boolean DNF formula can be expressed as a read-once monotone DNF for- 
mula with these intermediate concepts, and hence as a read-once formula and 
read-twice DNF over these intermediate concepts. Similar argument can be used 
to show ordered binary decision diagrams is as hard as learning unordered binary 
decision diagrams (a.k.a branching programs). The latter concept class can be 
used to represent any boolean formula [HTW96|Ha,r8hJ of at most the same size. 
Thus, learning ordered binary decision diagrams is also hard. □ 

We have the following hardness result for learning monotone DNFs. At first 
sight, the following proof seems to imply that an efficient algorithm for learning 
monotone DNFs using EQs and MQs in the standard exact model can be trans- 
formed to an algorithm for learning general DNFs. However, this is not the case. 
The reason being that the MQ oracle cannot be properly simulated, unlike in 
the transformation presented below. 

Theorem 9. Efficient learnability of monotone DNFs in the intermediate con- 
cepts mistake bound learning model with membership queries would imply effi- 
cient learnability of DNFs in the standard (single concept) mistake bound model 
using membership queried. 

Proof: As before, we reduce the problem of learning a single general DNF 

formula to learning multiple monotone formulas. We introduce one new vari- 
able Pi for each negated literal xf. Clearly, the single target (general) DNF 
/ would appear as a monotone DNF f with the original set of variables 
and the new variables. An instance x = {x\, ...,Xn) will be transformed to 
X = {xi, ...,Xn,yi, ■■■,yn) where yi = xi. For each pair of variables {xi,yi}, we 
introduce the intermediate concepts fi{x) = XiV pi. We also introduce another 
intermediate function f{x) = xipi V ... V XnPn V (/i(i) A ... A /„(i) A f'{x)). 

We claim that f{x) = f{x). Clearly, f/x) = -I- for each 1 < * < n, since 
in X exactly one of Xi and yt (= xi) is equal to 1. It is also easy to verify that 
f{x) = f'{x) = f{x). Therefore, to learn the single DNF formula /, we can 

^ The following proof is supplied by an anonymous referee. 
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simply assume that we are learning / in the transformed instance space with 
the intermediate concepts /i, 

Next, since the fi's are known fixed functions, we can answer membership 
queries on fi{x). It suffices to show that the membership query oracle MQf for 
/ can be used to simulate the membership query oracle MQ^for /. On input i, 
first checks whether there exists an i such that Xi = yi = 1. If so, it returns 

‘+’. Otherwise, / is reduced to /i A ... A /„ A /'. It proceeds to check whether 
there is an such that fi{x) = V = 0. If so, it returns Otherwise, / is 
reduced to /'. Now, if reaches this stage then for each i, we do not have 

Xi = yi = 1 but we have either Xi or is 1. In other words, we have yi and 
we can map (inversely) i to a: = {x\, ...,Xn) so that x = x. Now, to determine 
f{x) we simply ask MQf for /(x), which is equal to f{x) = f{x) as desired. □ 

In applications of intermediate concepts learning, we would typically expect 
that the number of target concepts to be much less than the number of attributes. 
The next natural question to ask is what concept class can be learned if there is 
a bound on the size of J-. When the size of J- is some small constant k and T 
are monotone DNF formulas, then clearly all the concepts can be represented as 
a monotone DNFs over X with a polynomial (but exponential in \T\) blowout 
in representation size. Thus, we have the following result. 

Observation 10 Suppose \T\ is a constant. Then monotone DNFs can be effi- 
ciently exactly learned in the intermediate concepts learning setting using equiv- 
alence and membership queries. □ 



5 PAC Learning Results 

Recall that the VC-dimension of a concept class C, denoted by VC-dim (C), is 
the size of the largest sample that can be labeled in all possible ways by concepts 
in C. The following theorem due to Blumer et al. [ BEHW891 gives a sufficient 
condition for learning a single target concept in the PAC model. 

Theorem 11. \BEHW89j Let C be an arbitrary concept class with finite VC- 
dimension. Suppose also that there is an algorithm A such that for all possible 
choices of target concept f G C, given any sample S of examples labeled according 
to f, with size of S at least 

/2, 2 8VC-dim(C) , 13\ 

max I - log -, log I , 

A outputs a hypothesis in C that classifies S the same way as f. Then A learns 
C in the PAC model. 

In the above theorem, the consistency problem that A is solving can be gen- 
eralized naturally to the intermediate concept learning setting. In our learning 
environment, each example x in the sample S are labeled according to k con- 
cepts /i, ..., /fe. These concepts are from a given concept class C and satisfy the 
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dependency relation defined by some dependency DAG G. An immediate (but 
not quite ‘correct’) objective is to output a set of hypotheses H = hi,...,hk G 
C such that for each x in S, \/i, hi{x, fi{x), fi_i{x), fi+i{x), fk(x)) = 
fi{x, fi{x), fi-i{x), fi+i{x), fk{x)). Note that this is somewhat well-defined 
for S since the instances in S are labeled. However, for an unlabeled instance 
X, the learner does not know the values of the fi{x)s. Therefore, in order to use 
H in making prediction, we require H to satisfy a dependency DAG that has 
depth smaller than that of G. Note that by our assumption of the PAG learning 
process, we know that such H exists. Under this dependency DAG constraint, 
the number of parameters that each fi is dependent on is smaller than n + k. 
Hence, the VG-dimension of fi is smaller than that of C defined over A U iF. 

We have the following extension of Theorem [H] in the intermediate concepts 
learning environment. 

Corollary 1. Suppose a concept class Gn,k has VC-dimension polynomial in n 
and k where n is the number of attributes in X and k is the number of interme- 
diate concepts (in T). Let S be an arbitrary sample of k-labeled examples labeled 
according to an ordered set T = {/i, ...,fk}, with size of S at least 

/ 2 k 2k 8k VC-dim (Gn k) 18k\ 

max I y log y , ^ ^ j ' 

If there is an algorithm A such that for all possible choices of target concept f G 
G, given any sample S of A outputs an ordered set of hypothesis {hi, ...,hk} in 
G that classifies S the same way as T , then A PAG learns G in the intermediate 
concepts learning setting. 

Proof: Let D be the distribution in which we draw the sample S. Let hi be 

a hypothesis for fi that is consistent with S and satisfies the dependency DAG 
constraint. By Theorem [TTl with probability at least 1 — the probability 
of hi{x) yf fi(x) for an instance x drawn from D is at most Therefore, 
with probability at least 1 — the probability that there exists i such that 
hi{x) yf fi{x) for an instance x drawn from D is at most e. □ 

Hence, to PAG-learn a concept class G in the intermediate concepts learning 
setting, it suffices to be able to solve our version of the consistency problem. The 
following theorem states that efficient algorithm for solving the single task con- 
sistency problem can be converted to solve the intermediate concepts consistency 
problem. 

Theorem 12. If there is an efficient algorithm A for solving the consistency 
problem for a concept class G in the single task setting then there is an efficient 
algorithm A! for solving the consistency problem for the concept class G in the 
ICL setting. 

Proof: The algorithm has the same flavor as the algorithms presented in Sec- 

tion 0and Sectional As before, we start with an initial guess k of the level of 
each fi as 1. We then run A to find a consistent hypothesis hi for the sample 
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in the same way as fi. If we cannot find such a hypothesis then we increment U 
by 1. We then repeat the same process of finding a consistent hi where the at- 
tributes being considered are those in X plus those intermediate concepts where 
the levels are smaller than li. 

In the first iteration, we can output a consistent hypothesis for each of the 
concepts with actual level 1. As before, it is easy to verify by induction that the 
we never overestimate the level of fi. This ensures that in the subsequent ith 
iteration, the set of concepts where a consistent hypothesis has been constructed 
contains all those concepts of level less than 1. This guarantees that the relevant 
functions that we consider in the ith iteration include the latter concepts. There- 
fore, we can construct a consistent concept that is of level i. □ 

6 Future Work 

In this paper, we show that efficient single-task learning algorithms without 
membership queries can be extended to efficient multi-task learning in most 
commonly studied models. However, in the presence of a membership query 
oracle, this may not be true. In particular. Theorem [TJ [H] and 0 suggest that 
most concept classes are difficult to learn in the intermediate concepts learning 
setting with equivalence and membership queries. 

Problem 1. An interesting question is whether one can give a characterization 
of these concept classes that are difficult to learn in the intermediate concepts 
learning setting with membership queries. Is there any concept class that can be 
learned in the ICL model with equivalence and membership queries? 

In most situations, we would expect the number of target concepts to be 
much smaller than the number of attributes. However, the only positive result 
obtained when \T\ is bounded, is only for the case where \T\ is constant and the 
concept class is monotone DNFs (See Observation [TOJ . 

Problem 2. Is there any concept class that can be learned when \T\ is O(logn)? 

Most results | |Lit88ICBLW95IKW94IHKW96| obtained in the mistake bound 
model is relative to the best hypothesis. That is, the number of mistakes made 
by the best hypothesis appears as an additive term. For example, Littlestone’s 
winnow algorithm makes at most 0{k\ogn + Mopt) mistakes in learning k- 
disjunctions. Here, n is the number of variables and Mopt is the number of mis- 
takes made by the best fc-disjunctions. However, in transforming these efficient 
online algorithm for learning single concept to intermediate concept learning 
setting (as in Section EJ, the number of mistakes made by the best hypothesis 
appears as a multiplicative factor. 

Problem 3. Can any concept class be learned where the total mistakes made 
by the best ordered collection of hypotheses appears as an additive term in the 
mistake bound? 
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In this preliminary research on intermediate concepts learning, we only ex- 
amine the three most popular learning models. A natural direction for future 
work is to investigate such learning in other learning models. 

Problem 4- Can efficient single task learning algorithms in various variations of 
PAC and exact model be translated to efficient intermediate concepts learning al- 
gorithms in the corresponding models? Exact learning model with various types 
of ‘imperfect’ m embersh ip query oracles [IAS91 IAK94 1AKST 97IB(X;S95J. agnos- 
tic PAC model |KSS94j and statistical PAC model |Kea,93] are some examples 
of interesting models to consider here. 
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Abstract. The multiple-instance model was motivated by the drug ac- 
tivity prediction problem where each example is a possible configura- 
tion for a molecule and each bag contains all likely configurations for 
the molecule. While there has been a significant amount of theoretical 
and empirical research directed towards this problem, most research per- 
formed under the multiple-instance model is for concept learning. How- 
ever, binding affinity between molecules and receptors is quantitative 
and hence a real-valued classification is preferable. 

In this paper we initiate a theoretical study of real-valued multiple in- 
stance learning. We prove that the problem of finding a target point 
consistent with a set of labeled multiple-instance examples (or bags) 
is NP-complete. We also prove that the problem of learning from real- 
valued multiple-instance examples is as hard as learning DNF. Another 
contribution of our work is in defining and studying a multiple-instance 
membership query (MI-MQ). We give a positive result on exactly learn- 
ing the target point for a multiple-instance problem in which the learner 
is provided with a MI-MQ oracle and a single adversarially selected bag. 



1 Introduction 

The multiple-instance learning model is becoming increasingly important within 
machine learning. Unlike standard supervised learning in which each instance 
is labeled in the training data, in this model each example is a set (or bag) of 
instances which is labeled as to whether any single instance within the bag is 
positive. The individual instances are not given a label. The goal of the learner 
is to generate a hypothesis to accurately predict the label of previously unseen 
bags. 

The multiple-instance model was motivated by the drug activity prediction 
problem where each example is a possible configuration (or shape) for a molecule 
of interest and each bag contains all low-energy (and hence likely) configurations 
for the molecule [B]. There has been a significant amount of theoretical and 
empirical research directed towards this problem. Other applications for the 
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multiple-instance model have been studied. For example, Maron and Raton [12] 
applied the multiple-instance model to the task of learning to recognize a person 
from a series of images that are labeled positive if they contain the person and 
negative otherwise. They have also applied this model to learn descriptions of 
natural images (such as a waterfall) and then used the learned concept to retrieve 
similar images from a large image database. More recently, Ruffo m has used 
this model for data mining applications. 

Most prior research performed under the multiple-instance model is for con- 
cept learning (i.e. boolean labels). The first empirical study of Dietterich et al. |6] 
used real data for the problem of predicting whether or not a synthetic molecule 
binds to the musk receptor. However, binding affinity between molecules and 
receptors is quantitative, borne out in quantities such as the energy released by 
the molecule-receptor pair upon binding and hence a real- valued classification of 
binding strength in these situations is preferable. Dietterich et al. say “The only 
aspect of the musk problem that is substantially different from typical pharma- 
ceutical problems is that the musk strength is measured qualitatively by expert 
human judges, whereas drug activity binding is usually measured quantitatively 
through biochemical assays.” 

Furthermore, the previous work has just considered learning from a given set 
of labeled bags. However, in the real drug-discovery application, obtaining the 
label for a bag (which corresponds to making the drug and then running a labo- 
ratory experiment) is very time consuming. Thus the most appropriate learning 
model is that of on-line learning with membership queries. More specifically, you 
begin with a random labeled example. Then a new drug is selected and created 
followed by an experiment to obtain its affinity value (i.e. the label), and so on. 
Selecting the next drug to test is very much like a membership query (which 
outputs a real- valued label) except one cannot select an arbitrary set of points 
to define a bag. 

Our goal here is to initiate a theoretical study on real-valued multiple in- 
stance learning which includes the introduction of a multiple-instance member- 
ship query (MI-MQ). We prove that the problem of finding a target point consis- 
tent with a set of labeled multiple-instance examples (or bags) is NP-complete. 
We also prove that the problem of learning from real-valued multiple-instance 
data is as hard as learning DNF. A key contribution of this paper is a positive 
result on exactly learning the target point for a multiple-instance problem in 
which the learner is provided with a MI-MQ and a single adversarially selected 
bag b= {pi,. . . ,pr-}. 

2 The Real-Valued Multiple-Instance Model 

Consider the standard PAG learning problem of learning an axis-aligned box in 
5R" . In this model each labeled example is a point in 3?" (drawn according to some 
unknown distribution T>) and labeled as positive if and only if it is in the target 
box. A boolean multiple-instance example is a collection of r points in 3?” (often 
called a bag or r-example) which is labeled as positive if and only if at least one 
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of the points in the bag is in the target box. For the drug discovery application, 
each bag corresponds to a drug and each point in the bag corresponds to the 
shapes that it is likely to take. 

We extend this model to the real-valued setting in the following manner. We 
assume that there is a target point t in 3?" which corresponds to the ideal shape. 
The label for bag b = {pi,...,pr} is max V{dist{t,pi)) where dist{t,p) is 

the distance between t and pi in the L 2 norm and V is some function that 
relates distance with binding strength. For example, V could be defined by the 
widely used empirical potential for intermolecular interactions, the Lennard- 
Jones potential V{d) = 4e where e is the depth of the potential 

well, cr is the distance at which V(d) = 0, and d is the internuclear distance for 
two monoatomic moleculesjl]. The Lennard-Jones model is nice because of its 
mathematical simplicity and ability to qualitatively mimic the real interaction 
between molecules. For the purposes of this paper, the only property we assume 
about the computation of the binding strength between p and q is that from 
it dist{p, q) can be obtained and that the binding strength diminishes as the 
distance to the target increases. An alternate definition for the label of 6 = 
{pi, ■■■iPr} is to compute 



drmn{b) = min dist{t,pi) 

and then return V{dmin{b)) as the label. We will use this view and further, 
assume that dmin itself is given. In general, one can extend this model by using 
a weighted L 2 norm but in this work we assume that an unweighted L 2 norm is 
used. 

We now define a multiple-instance membership oracle (MI-MQ). There are 
several reasons why we do not want to allow as input to our membership oracle 
an arbitrary bag (even of a some fixed size r). First, if allowed to do this then by 
perturbing the individual points in a given bag b, the learning algorithm could 
determine which point is closest to the target which would effectively reduce the 
problem to a single-instance problem. Secondly, as discussed earlier, in reality 
one can select a drug (which could be a small variation of an earlier drug tested) . 
However, the set of bags that correspond to real drugs are limited and in general 
there will not exist a drug that would have as its likely conformations the desired 
r points. The problem of defining a multiple-instance membership query that 
captures the physical constraints of the underlying chemistry is challenging and 
we do not claim to have solved that problem here. However, we propose the 
following model which we feel is a good starting point for developing a theory 
of learning with queries for real-valued multiple-instance learning. Given a bag 
b = {pi, . . . ,pr} where b is provided by an adversary, we define the MI-MQ 
oracle to be one that takes as input any n-dimensional shift vector v and returns 
the real- valued label for b + v = {pi -|- v, ... ,Pr + u}. One can think of this 
as a very rough approximation of what happens when the chemical structure of 
one of the previously tested drugs is slightly altered and thus creates a similar 
set of points in the bag but with some variations (which we model as a simple 
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shift). Since the molecule smoothly moves between conformations (shapes) an 
interesting problem is to find an alternate way to capture this dependence in the 
learning model. 

3 Prior Work 

We begin with a summary of the prior work on learning the (boolean) multiple- 
instance concept class of axis-aligned boxes in n-dimensional space. Long and 
Tan [3 described an efficient PAG algorithm under the restriction that each 
point in the bag is drawn independently from aproduct distribution, T> product ■ 
Hence the resulting distribution over r-example^is 'D product- Auer et al. gave 
an efficient PAG algorithm that allows each point to be drawn independently 
from an arbitrary distribution. Hence each r-example is drawn from T>'" for an 
arbitrary T>. In their paper, Auer et al. also proved that if the distributional 
requirements are further relaxed to be arbitrary distributions over r-examples 
then learning axis-aligned boxes is as hard as learning DNF formulas in the 
PAG model. Blum and Kalai [Q described a simple reduction from the problem 
of PAG learning from multiple-instance examples to that of PAG learning with 
one-sided random classification noise when the r-examples are drawn from I?'' 
for T> any distribution. They also described a more efficient (and more involved) 
reduction to the statistical-query model [Hj that yields the most efficient PAG 
algorithm known for learning axis-aligned boxes in the multiple-instance model. 
Their algorithm has sample complexity 0(n^r/e^), roughly a factor of r faster 
than the result of Auer et al. 

To understand some of the difficulties that occur when switching from the 
boolean to real-valued setting, we briefly overview the basic technique used to 
obtain these results. They all use the key property that all regions in 3?" not 
intersecting with the target box have the same value for the fraction of examples 
in the region that are positive (since they must be false positives which occurs 
if none of the other r — 1 points are in the target box) . Hence one can obtain a 
PAG algorithm as follows. In each dimension, consider the axis-aligned halfspace 
defined by each point. Since all points in the target region are positive, the 
halfspace that cuts off roughly e/(2n) weight from the target region can be 
detected by comparing the fraction of the positive and negative examples of 
a sequence of halfspaces moving towards one of the 2n box boundaries (from 
infinity) at discrete locations as defined by the points in the sample. It is easily 
seen that the distributional assumptions can be slightly relaxed. Namely, suppose 
that each r-example is drawn from T>i x ■ ■ ■ x Vr- Let be the weight of the 
positive examples in T>i. As long as = T >2 = • • • this key property still 
holds and hence the class of axis-aligned boxes is PAG learnable in this model. 
If we relax the distribution assumptions more and no longer require all T>f to 
be the same then this technique of individually finding each of the 2n halfspaces 
breaks down. 

^ We use r-example versus bag when it is required that all bags contain exactly r 
examples. 
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We now consider the real-valued setting where the label for bag 6 is a func- 
tion of the distance between the closest point in the b and the target. In this 
setting, the sharp change that occurs in the fraction of positive examples as a 
halfspace crosses the boundary of the box (in the boolean domain) is no longer 
present. Hence, a completely different approach is needed. However, in order to 
ensure that we obtain an algorithm that is polynomial for an arbitrary number 
of dimensions, we must in some way be able to independently work with each 
dimension (or at least a constant number of dimensions at a time). 

The only theoretical work which we are aware of that studies real-valued 
multiple-instance learning is work by Goldman and Scott |7]. Similar to our 
work here, they associate a real- valued label with each point in the multiple- 
instance example. These values are then combined using a real-valued aggrega- 
tion operator to obtain the classification for the example. Here we only consider 
the minimum for the aggregation operator. They provide on-line agnostic algo- 
rithms for learning real-valued multiple-instance geometric concepts defined by 
axis-aligned boxes in constant dimensional space by reducing the learning prob- 
lem to one in which the exponentiated gradient (or gradient descent) algorithm 
can be used. However, their work (and their basic technique) assumes that d 
is constant which is not feasible for the drug discovery application since d is 
typically in the hundreds. 

Most empirical work also considers the boolean setting. Recently, Amar, 
Dooly, Goldman, and Zhang |T] empirically studied diverse-density based and 
k-citation nearest neighbor based algorithms for learning in the real-valued 
multiple-instance model. However, even for the original versions of the diverse 
density m and A:-citation nearest neighbor algorithms m for the boolean do- 
main, no theoretical results have been shown. Also, Ray and Page m stud- 
ied multiple-instance linear regression using artificial data to empirically evalu- 
ate their algorithm which uses an inductive logic programming based approach 
combined with a linear regression algorithm supplemented with EM. Again, no 
theoretical results are given in their work. The goal of our work here is to begin 
developing theoretical foundations for the real-valued multiple-instance model 
for high-dimensional spaces. 



4 Results for the Real- Valued Multiple-Instance Model 



For the reminder of this paper we study the real-valued multiple instance prob- 
lem when we assume that each bag is drawn from an arbitrary distribution 
T> and can have any number of examples within it. We define the Real- Valued 
Multiple-Instance L 2- Consistency Problem as the following problem. As input 
you are given a set S of r-examples each labeled with a real-value. The problem 
is to determine whether or not there is some target point t S 3?" such that 
the label given to each bag is consistent with target t where we assume bag 
b = {pi, . . . ,Pr} for target t would receive the label mini=i^.,._j. dist{t,pi) with 
the L 2 norm for the distance metric. 



172 



D.R. Dooly, S.A. Goldman, and S.S. Kwek 



4.1 Negative Results 

In this section we present some negative results demonstrating the the general 
multiple-instance learning problem is hard. 

Theorem 1. The Real-Valued Multiple- Instance L2-Consistency problem is 
NP-complete. 

Proof. The proof is by reduction from 3-Sat. The instance space has n dimen- 
sions, one for each variable. The 3-Sat formula is transformed into a collection 
of bags as follows. For each clause in the formula, we introduce a bag of 3 points 
and assign it a label corresponding to a distance of y/n — 1. Each of these points 
corresponds to a literal in the clause with all coordinates set to 0 except for 
the coordinate that corresponds to the literal. If the corresponding literal is a 
negated literal Xi then the coordinate is set to -1, otherwise it is set to 1. In 
addition, we also add to this collection a bag containing only the origin with a 
label corresponding to a distance of y/n. 

Suppose the point t = (ti,...,t„) labels these bags consistently. Let s = 
(si,...,s„) be an arbitrary point in a bag corresponding to a clause such that 
dist{t, s) = y/n — 1. WLOG, suppose Si ^ 0. Then, 



From these two equations, we can deduce that ti = Si. 

Therefore, if there is a point which labels the bags consistently, we transform 
it into an assignment of variables which satisfies all the clauses as follows: If 
the coordinate of the point in dimension i is = —1, assign false to variable Xi. 
If the coordinate of the point in dimension i is = I, assign true to variable Xi. 
Otherwise assign either true or false, at random. For each clause, at least one of 
the three relevant coordinates of the point will cause an assignment to a variable 
which makes that clause true. So the assignment satisfies all the clauses. 

If there is an assignment of variables which satisfies all the clauses, then 
the point with coordinate 1 in dimensions corresponding to true variables and 
coordinate -1 in dimensions corresponding to false variables will meet all the 
distance criteria, since it is at distance y/n from the origin, and there will be at 
least one of the three points in each bag for which it is at distance y/n — 1. □ 

Theorem [T] does not indicate that learning is hard, but only that any learning 
algorithm that requires the consistency problem to be solved is not feasible. 
We now give a hardness result showing that the real-valued multiple-instance 
learning problem is as hard as learning DNF even if the learner is allowed to use 
a hypothesis class that is not simply a point in SR". The statement of this result 




On the other hand. 



n 



dist{t, 0 ) 




= y/h. 
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is very similar to the hardness result for the boolean multiple-instance model of 
Auer et al. [2|. We note that our result does not follow from their results since 
each bag in the boolean model is labeled as positive iff it is in the target box. 
Here, each bag must be labeled with the distance between its closest point and 
the target point. Hence, neither results subsumes the other. 



Theorem 2. Learning in the real-valued multiple-instance setting is as hard as 
learning DNF. 



Proof. This proof is by reduction from the problem of learning r-term DNF to 
the problem of learning in the real-valued multiple-instance setting. Without 
loss of generality, we assume the target concept does not classify every instance 
as false. For ease of exposition, we assume that n > 2. 

First, consider the extremely simple case where there is only a single variable 
x\. In this case, the target DNF formula (j) can only be xi, xi or 0 (i.e., true). 
These candidate formulas are represented using the vertices P (for positive), N 
(for negative) and I (for irrelevant) respectively of an equilateral triangle in the 
two-dimensional Euclidean plane. The origin O is also the center of this triangle 
is distance 1 away from the vertices (see Figure HI). Let e = ^ - Let T (for true) 
be a point that lies outside the triangle and is ^/l — e away from both points 
P and I. Similarly, let F (for false) be a point that lies outside the triangle 
and is Vl — e away from both points N and I. An instance x is then mapped 
to r if xi = 1 and F otherwise. Let g{x) and g{(j)) be the points obtained by 
transforming x and (p as described above. Since n > 2 (and hence e < 1/4), it is 
straightforward to verify that dist{g{x), g{(j))) = y/1 — e if x satisfies p since T 
and F are outside of the triangle defined by /, N, and P. (See Figure [U) 




Fig. 1. The geometry of points I, P, N, F, T, O for the proof of Theorem 2. 
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Next, we extend this reduction to conjunctions. Here, each literal is repre- 
sented by a point in some 2-dimensional Euclidean plane. That is, the trans- 
formed instance space is of dimension 2n. The coordinates of the target (j) cor- 
responding to the variable Xi are set to the point g{4>i) = P, N, or I depending 
on whether the literal Xi is present as positive literal, present as negative literal 
or is absent in (j>, respectively. Similarly, the coordinates corresponding to the 
value of the variable Xi in an instance x are set to the point g{xi) = T or F 
depending on whether x* = 1 or = 0. As before, if Xi does not falsify the 
target </>, then dist{g{xi),g{4>i)) = Vl — e otherwise dist{g{xi),g{4>i)) > 3/2. 
In other words, if x satisfies (j) then dist{g{x), g{(j))) = y^n{l — e), otherwise 

dist{g{x) , g{(j))) > \J\ + {n— 1)(1 — e) > y/n. To change the latter inequality 
to an equality so that all the negative instances in the DNF learning problem are 
mapped into points with unique values, we treat g{x) as a bag and add another 
point o = (0, ...., 0). Clearly, dist{g{<j)) , o) = y/n for all possible choices of (j) and 
hence dist{g{x),g{(j))) = if x does not satisfy (j). 

Finally, we extend the reduction to r-term DNF. The transformed instance 
space is of dimension 2n x r. An instance x is mapped into a bag of B{x) of r-|- 1 
points. This bag contains the origin o = {O , ..., O). For each T^, we introduce a 
point Pi into the bag. The values of the (2n(i — 1) -I- l)th to (2nt)th coordinates 
of Pi are set as described in the previous paragraph. The other coordinates are 
set to O. Viewed another way, in our translation of the problem, we have a 2- 
dimensional subspace for each combination of boolean variable and term of the 
DNF formula, giving rn such subspaces in all. We use Sij to denote the subspace 
associated with the zth variable of term j. We transform the given r-term DNF 
formula (j) into a target point in 2rn-dimensional space as follows. If the jth term 
in (j) contains a positive variable Xi, we set the coordinates for subspace Sij to P. 
If the j term contains Xi we set S^j to N. Finally, if variable Xi does not appear 
in the jth term of 4>, we set Sij to I. As an example if (/ = (xi Ax 2 Aa; 4 )V(a;i Axa) 
(so n = 4 and r = 2) then the target point (/((/) = PNIP PIPI where the first 
four pairs correspond to the first term and the second four pairs correspond to 
the second term. 

We translate an assignment of boolean variables as follows. The assignment 
xi,X 2 , ■ ■ ■ ,Xn is transformed into a bag containing r -|- 1 points. For 1 < j < r, 
in the of these points, for each variable = 0, we set the coordinates 
in subspace Sij to F. For each variable Xi = 1, we set the coordinates in 
subspace Sij to T. For all k yf j, we set the coordinates in subspace Sik to 
O. For the last point, we set the coordinates in all subspaces to O. So con- 
tinuing our example, assignment x = 1001 would be translated to the bag 
g{x) = {TFFT 0000,0000 TFFT,0000 OOOO}. Each of the first r 
points tests to see if this assignment satisfies term j, while the last is a reference 
point known to be closer to the target than the point corresponding to any unsat- 
isfied term. To complete the transformation, we give the positive bags the value 
y^rn — ne = i/rn— 1/2 and the negative bags the value y/rn. We also include 
rn bags containing three points each. In each group of three, the coordinates in 
one of the subspaces Sij are assigned to P,N, or I, and the coordinates in all 
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other subspaces are assigned to O. Each of these rn bags has value ^/rn — 1. 
Finally, we have a bag containing one point o. In all of the subspaces Sij the 
coordinates are assigned to O. This bag has value -\/rn. 

As for the case of conjunctions it can be easily verified that if x satisfies 
4>, then dist{g{x),g{(j))) = \/n{l — e) -|- (r — l)n = y/rn — en = yjrn — 1/ 4,. 
Conversely, if x falsifies Ti then 

dist{g{x) , g{cj))) > \/(n — 1)(1 — e) -|- 9/4 -I- (r — l)n > a/to 



since e = l/(2n). Further, since dist{o, g{(p)) = y/rn, 

I y/rn — e, x satisfies d> 
dist{g{x),g{(j>)) = > 



rn, X does not satisfy (f) 



as desired. 

Suppose that we have an algorithm which is able to find a 2TO-dimensional 
point p which has a distance to each provided bag where the distance equals 
the specified label. For any 2rn-dimensional point p, we use pij to denote the 
value of p for subspace 5'^-. Let be the one of P,N, or I which is closest to 
Pij. We now argue that the distance between pij and qtj is zero. That is, pij 
must be one of P, N, or I. From the bag with O for all subspaces, we have 
j dist{pij, O)^ = rn. Multiplying both sides by to — 1 yields 



(to — 1) dist{pij, O)^ = rn{rn — 1). 
hi 



( 1 ) 



From the rn bags with three points each we have that dist(pke,qki)'^ 
+ ~ rn — 1. Summing over the rn subspaces we get: 

Y,Mdist{pM,qu)'^ +J2k£J2i^k,j^id,ist{ptj,0)'^ = rn{rn - 1). Using the ob- 
servation that, J2i^k,j^e dist{pij,Oy = {rn - 1) dist{pij,OY gives 



dist{pki, qkt)"^ + {rn - 1) ^ dist{p,,j,0)‘^ = rn{n - 1). (2) 

k£ ij 

Combining Equations [T] and gives that = 0 hence pki 

must be one of P, N, or /. 

Let us now consider a positive bag (i.e. a bag with label y/rn — ne). One 
of the points in this bag must be at distance y/rn — ne from the target point 
t = g{4>). Let it correspond to term j and let us call it 2 :. Since we know that 
dist{zik,0) = 1, we can subtract the distance in all the subspaces except those 
corresponding to term j to get J/^dist{tij, Zij)'^ = n(l — e). So each variable i 
must satisfy term j. Let us consider a negative bag. All of the points in this bag 
must be at least distance y/rn from t. Let us pick a point w corresponding to 
term j. There must be at least one subspace for which dist{tij,Wij)'^ > 1. The 
only way this can happen is for variable i to fail to satisfy term j. So we can 
read the terms of the DNF from the values that t takes in the subspaces. If ty 
has location P, then term j contains literal Xi. If Uj has location N, then term j 
contains the literal Xi. Finally, if Uj has location /, then term j does not contain 
include Xi (or its negation). □ 
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4.2 Our Positive Result 

In this section we present a positive result. Let b be an arbitrary bag provided by 
an adversary. We assume that we have access to a MI-MQ oracle and that from 
the label provided by this oracle we can then compute the distance between the 
closest point in b + v and the target t where v is the input given to the MI-MQ. 
It is important to remember that although we can compute the distance between 
the target and the closest point from b+v this provides no information as to 
which point in 6 -|- u is closest to the target. 

The high-level approach used by our algorithm is to (working one dimension 
at a time) determine the coordinate in each dimension of the target point. We 
do this by finding a set of vectors for which we can determine the coordinate 
of the point which is closest to the target point, and using these to determine 
the coordinate of the target point. If we could guarantee that we have a unique 
closest point in a bag, and we pick small enough distances in each dimension, 
we can use this technique to determine the gradient of the distance function. 
From that, we can determine an offset from the bag which reaches the target. 
From that we can determine r such offsets, one from each point in the bag. 
This uniquely identifies the target. If this process fails (because there was not a 
unique closest point), one could randomly move the bag. With high probability, 
one would then have a unique closest point, since failure occurs only when the 
target lies on a hyperplane equidistant from two closest points. This probabilistic 
algorithm is more intuitive, but is not guaranteed to halt. Hence, we instead 
present a somewhat more complex polynomial-time deterministic algorithm. 

In order to describe our algorithm in more depth, the following definitions 
are needed. For any target t, any vector v, and any point p, we define the 
function dt^p{v) = ||t— (p-|-u)||^. For a bag b = {pi, . . . ,Pr} of points, we define 
dt^b{v) = mini=i,.,._r dt^p{v). 

For each dimension 1 < j < n, we consider the function s[^p{x) = dt^p{x ■ vj) 

where vj is the unit vector in dimension j and a: is a scalar. That is, s[^p{x) = 
(pi, . . . ,pj-i,pj + x,pj+i, ...,Pn)- Observe that 

k^j k^j 

for constants (with respect to x) of £ = —PkY x' = (tj —pj), which 

is a parabola. Finally, let s[^l(x) = minpgb s^'^p(a;). 

Figure [2l shows a visualization of s[^l(x). First, notice that it is obtained by 
combining the r parabolas given by s[^p{x). For each value of x, the value of 
s^l(x) is that of the parabola that has the lowest value at x. We let denote 
the parabola (for dimension j) that defines s[^l(x) going from left to right. 

Let fr be the rightmost parabola. Suppose for a moment that we could 
determine which point p^^'^ corresponds to fr^\ If this were the case, then the 
value at which fr attains its minimum would give us the translation for which 
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Fig. 2. A visualization of s[^l{x). Notice that the y-intercept £o of fi corresponds to 
the label of bag b. 



Pj is closest to the target. From this we can determine the coordinate of the 
target in that dimension. 

Our goal is now to independently, for each dimension j, find the minimum of 
the parabola corresponding to pj. For ease of exposition, suppose this parabola 
was fr and that its minimum value occurred at a value of Xr- We now demon- 
strate how we can use the MFMQ oracle to find fr^\ Once we have done this, 
we have found a shift xj for dimension j for which pj + Xj ■ Vj = tj. By then 
repeating this in each dimension, we have enough information to compute the 
point t = -I- Xi,p‘'2^ +X2,... + Xn}. 

As discussed above, in s[^l(x) there are r parabolas, one for each point in b. 
For each parabola it reaches a minimum value for the value of x that represents 
the dimension j shift for which tj — pj = 0. It is important to note (as shown 
above) that all r parabolas are of the form £+ {x — x'Y where £ and x' may be 
different for each of the points. In particular, for point p, I is the label for bag b 
that would be obtained if bag b were shifted in dimension j so that tj — pj = 0 
and x' is the value of x where this parabola reaches its minimum value. Our first 
lemma shows that as we translate far enough in a given dimension, the closest 
point to the target will eventually be one of those with the smallest coordinate 
in that dimension. We now formalize this intuition in the following lemma. 

Lemma 1. Let bag b have r points and let v be any vector. Let he the unit 
vector corresponding to v. Let Pm be the set of points in b for which p ■ v is 
minimized for p G b. Let di be the minimal non-zero value of the projection along 
V of the vectors from points in Pm to points in P. That is, di = miiip^i,{p — Pm) • 

for Pm any point in Pm- Similarly, let d 2 be the maximal projection of the 
distance along the hyperplane, normal to v, between any Pm and any other point 
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p G b. Let do be the distance between b and the target. Then for any translation 
X greater than + <^o + di, the closest point inb + xv from the target will 

be a point in Pm- 



Proof Sketch: The distance between any point in b + xv and the target can 
be expressed in terms of the component along v and the component normal to 
V. Let us denote the component along v from any point Pm G Pm to the target 
as dv = {pm — t) ■ V. For such a translation a:, dy > ^ distance from 

any other point p to the target is at least that of its component along v, which 
is at least dy + di, while the distance from some point Pm G Pm to the target is 
at most y/dl + {do + ^ 2 )^ < y/ dl + 2didy < \/dl + 2didy + d\ = dy + di. □ 



Procedure Find_Coordinate (j, b) 

Let do be the label of bag b 

Find all points Pm with minimal j coordinate Xm 
Find a point pi with second-minimal j coordinate Xi 
Let di = Xm — Xi 

Find the point pj with maximal value of do = distance^ {pm,Pj) — d\ 
For z = 1,2,3 

Let u = 0 except set Vj = — \- do -\- d\ -\- z 

4 = MI - Mq(fo,u) 

Let j/z be the distance corresponding to label Iz 

Let fr^'^ be the parabola of form £i + {x — XiY that contains the 

points {xQ,yo),(xi,yx), [xo , J/ 2 ) 

(7) 

Let Xr be the point at which fP is minimized. 

Return Xr Xm 



Fig. 3. The procedure Find_Coordinate searches for the coordinate of the target in 
dimension j. 



We now describe how we use this lemma in the procedure Find_Coordinate 
(see the detailed pseudo-code in Figure [S]). We find the set of leftmost starting 
points and a translation large enough to make one of them the closest point to 
the target. We then use the MFMQ oracle to query the value of s^^l(x) for 3 
points, farther out. These points will lie on a parabola. The minimal value of 
the parabola gives us the j-coordinate of the target. 

We now consider the procedure Find_Target which is the overall procedure 
to learn the target point (see Figure |4]). It independently finds the coordinates 
of the target in each each of the n dimensions. 

Theorem 3. Assuming that each call to the MI-MQ oracle takes constant time, 
Find-Target has a worst-case time complexity of 0{nr^) and is guaranteed to 
output the target point t. 
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Algorithm Find_Target (6) 

Find distances between all points of b 
For each of the n dimensions k — 1, ... ,n 
Let Vi = Find_Coordinate(fc,fe) 
Return v = (ui, . . . , v„) 



Fig. 4. The algorithm Find_Target. Note that all bags created are linear transforma- 
tions of the original bag b provided by the adversary. 



Proof Sketch: Find_Coordinate takes O(r^) time to find the Xj. So each repe- 
tition of the loop in Find_Target takes 0{r^) time. Since there are n repetitions 
of the loop, the total time taken is O(nr^). □ 

5 Concluding Remarks 

In this paper, we present some hardness results and a positive result for learn- 
ing in a real-valued multiple-instance learning model. We hope that this work 
will be the beginning of a theoretical study of learning in the real-valued multi- 
ple instance model and eventually lead to improved algorithms for applications 
such as drug discovery. There are many interesting open problems. For exam- 
ple, are there non-trivial distributional assumptions, for which there is an ef- 
ficient PAG learning (or on-line learning) algorithm to approximate the target 
point from real-valued multiple-instance data? Similarly, can hardness results 
be shown for more restricted distribution? Finally, are there alternate defini- 
tions for a multiple-instance membership query that better capture the physical 
constraints of the drug-discovery application. 
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Abstract. The paper introduces a way of re-constructing a loss function 
from predictive complexity. We show that a loss function and expecta- 
tions of the corresponding predictive complexity w.r.t. the Bernoulli dis- 
tribution are related through the Legendre transformation. It is shown 
that if two loss functions specify the same complexity then they are 
equivalent in a strong sense. 



1 Introduction 

Predictive complexity was introduced in [H] as a natural development of pre- 
diction with expert advice. Predictive complexity bounds the cumulative error 
suffered by any on-line learning algorithm on a sequence. It may be considered 
an inherent measure of “learnability” of a string in the same way as Kolmogorov 
complexity reflects the “simplicity” of a string. 

Different measures of error (loss functions) specify different variants of pre- 
dictive complexity; some of them do not have any corresponding predictive com- 
plexity at all. This paper shows how a loss function may be recovered from the 
predictive complexity K, it generates. The loss function A and the expectations 
E1C{(^), where C is a random variable distributed according to the Bernoulli law, 
are related via the Legendre transformation. Appendix A contains a brief in- 
troduction to the theory of the Legendre transformation in the one-dimensional 
case. 

We show that if two loss functions specify the same complexity then they 
are equivalent in a very strong sense (virtually equal up to a parametrisation). 
This observation allows us to show that the variants of Kolmogorov complexity, 
namely, plain, prefix, and monotone, do not correspond to any game and thus 
are not predictive complexities. Another variant of Kolmogorov complexity, the 
minus logarithm of Levin’s a priori semimeasure, is known to be the predictive 
complexity specified by the logarithmic game. 

* Some results from this paper form a part of the technical report “The Existence of 
Predictive Complexity and the Legendre Transformation”, CLRC-TR-00-04, Com- 
puter Learning Research Centre, Royal Holloway College, University of London; 
these resnlts were presented at Fourth French Days on Algorithmic Information 
Theory, TAI2000. 
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2 Preliminaries 

2.1 Games and Superpredictions 

A game 0 is a triple (12, F, A), where 12 is called an outcome space, F stands for 
a prediction space, and A : 12 x A ^ IR U {+oo} is a loss function. We suppose 
that a definition of computability over 12 and F is given and A is computable 
according to this definition. 

In this paper we are interested in the binary case 12 = B. We will denote 
elements of B* (i.e. finite strings of elements of B) by bold letters, e.g. x,y. By 
log we denote logarithm to the base 2 . 

We impose the following restrictions in order to exclude degenerated games: 

1. The set of possible predictions T is a compact topological space. 

2. For every w S 12, the function A(a;, 7 ) is continuous (w.r.t. the extended 
topology of [— oo,+oo]) in the second argument. 

3. There exists j G F such that, for every w S 12 the inequality < +oo 

holds. 

4. If there are G F,ujq G fi such that A(wo)7o) = +oo, then there is a 
sequence of G F, n = 1,2,..., such that ^ 70 as n ^ oo and 
A(wo,7n) < + 00 . 

Conditions 1-3 have been taken from |7]. Condition 4 essentially means that A 
accepts the infinite value only in exceptional cases which can be approximated 
by final cases. 

We say that a pair (sq, si) G + 00 ]^ is a superprediction if there exists a 
prediction 'y G F such that sq > A(0, 7 ) and si > A(l, 7 ). If we let P = {{po,Pi) G 
[— 00 , + 00 ]^ I 37 G T : po = A(0,7) and Pi = A(l, 7 )} (cf. the canonical form 
of a game in m), the set S of all superpredictions is the set of points that lie 
“north-east” of P. 

The set S is of fundamental importance and many interesting properties of a 
game may be described in terms of S. This observation leads to the definition: 

Definition 1. Two games are equivalent if they have the same set of superpre- 
dictions. 

The following simple lemma describes the class of sets S which may occur as 
sets of superpredictions. 

Lemma 1. A set S C [— 00 , -|-oo]^ is the set of superpredictions for some game 
0 satisfying conditions 1~4 if and only if the following conditions hold: 

— for every (x, y) G S and every a,b G [0, 3-oo], we have {x a,y b) G S, 

— there are a,b > —00 such that S C [a, -loo] x [ 6 , -koo], 

— S' n IR^ yf 0, and 

— the set S is the closure of its final part S H IR^ w.r.t. the extended topology 
of [— 00 , -koo]^. 
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The last item is the direct counterpart of Condition 4. 

Let us describe the intuition behind the concept of a game. Consider a pre- 
diction algorithm 2t working according to the following protocol: 



FDR t = 1,2, . . . 

(1) 21 chooses a prediction 7t s r 

(2) 21 observes the actual outcome Wj C C 

(3) 21 suffers loss 
END FDR. 

Over the first T trials, 21 suffers the total loss 

T 

Lossa(wi,tu2, . . . ,wt) = 

t=i 

By definition, put Lossa(Tl) = 0, where A denotes the empty string. 

The function Lossa(3;) can be treated as the predictive complexity of x in 
the game 0 w.r.t. 21. We will call these functions loss processes. Sadly, the set of 
loss processes has no universal elements unless in degenerated cases. Let us now 
proceed to defining a universal complexity measure. 



2.2 Predictive Complexity 

Let us fix a game 0. A function L : 17* — > IRU{-|-oo} is called a superloss process 
w.r.t. 0 (see i) if the following conditions hold: 



— L{A) = 0 , 

— for every x G 17*, the pair (L{x0) — L{x) , L{xl) — L{x)) is a superprediction 
w.r.t. 0, and 

— L is semicomputable from above. 

We will say that a superloss process K is universal if for any superloss process L 
there exists a constant C such that Vx G 17* : K{x) < L{x) + C . The difference 
between two universal superloss processes w.r.t. 0 is bounded by a constant. If 
universal superloss processes w.r.t. 0 exist we may pick one and denote it by 
/C®. It follows from the definition that, for every prediction algorithm 21, there is 
a constant C such that for every x we have JC®{x) < Loss|^(a:)-|-C', where Loss® 
denotes the loss w.r.t. 0. One may call /C® (predictive) complexity w.r.t. 0. 

3 Expectations of Complexity 

Theorem 1. Let & be a game satisfying the conditions I-). Let 0 speeify eom- 
plexity JC and let S be the set of superpredictions for 0. Then for every p G (0, 1) 
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(i) there exists a finite limit 



f(p) = 



lim 



jcii 



(p) 






(2) 



where d^\ . . . are results of n independent Bernoulli trials with the 

probability of 1 being equal to p, and 
(ii) the equality 



/» = -pr(^) , (3) 

holds, where f* is the function conjugated to f specified by f{x) = inf{y | 
(x,y) G ^10 for every x G M. 

Proof. In order to apply the Legendre transformation, we should make sure that 
/ is convex. This fact is implied by the following lemma. 

Lemma 2. If a game © satisfying conditions l-j specifies predictive complexity, 
then the intersection of its set of superpredictions S and IR^ is convex. 

The proof is in Appendix B. 

The following proposition allows us to estimate the expectations. 

Proposition 1 ([2]). Let 05 be a game with the set of superpredictions S. Sup- 
pose that p G (0, 1), the game © specifies complexity 1C, and the numbers p\ < p 2 
are such that 

V(x, y) G S : {1 - p)x -I py > Pi , (4) 

but 

3(xo,2/o) G STl : (1 -p)xo +P2/0 < P2 • (5) 

Then there is C > 0 such that, for every n G N, we have 

Pin<ElC{fd ■■■S,''d)<P 2 n + C , (6) 

(p) (p) 

where ^ are results of n independent Bernoulli trials with the proba- 

bility of 1 being equal to p. 

It follows from Proposition [IJthat there is C > 0 such that, for every p G (0,1) 
and every positive integer n, we have 

a{p)n < ElC{fd ■ ■ ■ + C , (7) 



where 



^ We (by definition) assume that inf 0 = +oo. 
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a{p) = inf \{1 — p)x + py] 

{x,y)&S 

= inf [{I - p)x + pf{x)] 

fcGlH 

= — p sup x — j[x) 

kgh L P 




( 8 ) 

( 9 ) 

( 10 ) 

( 11 ) 



Corollary 1. Let two games 0i and ©2 satisfy eonditions 1~4- Suppose they 
have the sets of superpredietions Si and S 2 and specify complexities Kf and Kf . 
If there is a function 6{n) = o{n) as n — > 00 such that for every a; G B* the 
inequality 

|/C^(a:)-/C2(a;)| <<5(|a;|) (12) 

holds, then Si = S 2 and complexities Kf and Kf are equal up to a constant. 
Proof. For every p G (0,1) we have 



. . . d^^) 

as n ^ 00 , where , • ■ • , are as above. This implies that for every p G (0,1) 
the equality /i(p) = f 2 {p) holds, where fi and /2 are defined for the games ©1 
and ©2 by (^. Thus ff {t) = /|(t) for all t G (— oo,0), where fi and /2 are 
defined in the same way as / in (ii) of Theorem [T1 We have /i (0) = /J(0) by 
continuity; for every t > 0 the equality ff{t) = fi{t) = +00 holds. It follows from 
a fundamental property of conjugated functions, namely, /** = / (Proposition 
l^from Appendix A), that the functions fi and /2 coincide, where fi and /2 are 
defined in the same way as / in (ii) of Theorem[2 This implies that Si = S 2 . □ 



< = 0 ( 1 ) 



(13) 



Corollary 2. There is no game specifying plain Kolmogorov complexity K, pre- 
fix complexity KP, or monotone complexity Km as its predictive complexity. 



Proof. The difference between any of this functions and the negative logarithm 
of Levin’s a priori semimeasure is bounded by a term of logarithmic order of the 
length of a string. If one of the functions had been predictive complexity for a 
game, this game would have been equivalent to the logarithmic game, which has 



A(w,7) 



f— log(l — 7 ) if w = 0 
( — log 7 if w = 1 



(14) 



7 G [0,1], and complexity would have coincided with logarithmic complexity 
But coincides with the negative logarithm of Levin’s a priori semimea- 
sure (see [S]). However neither of the differences between these functions and KM 
can be bounded by a constant (see |9|d| ). □ 
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Appendix A: Legendre Transformation 

The Legendre(-Young-Fenchel) transformation may be defined for functionals 
on a locally convex space. However everything we need in this paper is just 
the simplest one-dimensional case. We will follow the treatment of the one- 
dimensional case in [Ij; the general theory of this transformation and conjugated 
functions may be found in m- 

Consider a convex function / : IR — > [— oo,-|-oo]. The conjugated function 
/* : IR ^ [— oo, +oo] is defined by 

f*{t) = sup {xt- f{x)) . (15) 

fcGlR 

A function g : IR ^ [— oo, +oo] is called proper if Va; € IR : g{x) > —oo 
and 3x S IR : g{x) < -boo. A proper g is closed if for each real a the level set 
Lq = {x e IR I g{x) < a} is closed w.r.t. the standard topology of IR. 
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Fig. 1. Evaluation of the Legendre transformation. 



Figure o provides an example. In the picture we have 

/M = |' 

I +00 otherwise 

and we evaluate /*(— 1/2). The supremum from (11511 is achieved at a; = -\/2. 

Proposition 2 (see |H|SJ). If f [— oo,+oo] is a proper convex function, 

the following properties hold: 

(i) f* is convex, proper and closed, and 
(ii) if f is closed, /** = /. 



Appendix B: On a Necessary Condition for the Existence 
of Predictive Complexity 

Proof (of Lemma Assume the converse. Consider a game © with the set of 
superpredictions S such that S (H is not convex but there exists complexity 
/C w.r.t. ©. 

There exist points Bq,Bi G S such that the segment [Bq, Bi] is not a subset of 
S. Without loss of generality we may assume that Bq = {bo, 0),Bi = (0, bi) (see 
Fig. 12 ]). Indeed, a game © with the set of superpredictions S specify complexity 
if and only if a game ©' with the set of superpredictions 5" which is a shift of 
S (i.e. there are a,b gM such that S' = {{x',y') G (— oo,+oo]^ | 3{x,y) G S : 
x' = X + a, y' = y + b}) specifies complexity. 

There exists a point A = (ao,ai) with 01,02 > 0 on the boundary of S and 
above the straight line passing through Bq and Bi. Let us denote this line by I 
and let us assume that it has the equation aox + a\y = p, where ao, a\,p > 0. 

Let us denote the numbers of Is an Os in a string x by and {to®, respec- 
tively. Since 6otto^r and are superloss processes, there is C > 0 such that. 
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Fig. 2. The set S is coloured grey. 



for every x G IR^, the inequalities 

/C(x) < boHox + C (16) 

/C(x) < biiiix + C (17) 

holds. At the same time, there is a sequence of strings Xi,X 2 , ■ ■ ■ such that for 
any n G N we have \x^\ = n and 



/C(a;„) > ooSoa: + aitlia; . (18) 

The construction of Xn is by induction. Let Xq = A. Suppose we have constructed 
£C„. The point (/C(a;„0) — /C(£c„), /C(x„l) — /C(a:„)) should lie in at least one of 
the half-planes {{x,y) \ x > Oq} or {{x,y) \ y > Oi} i.e. at least one of the 
inequalities 

/C(a;„0) - /C(a:„) > ao (19) 

JC{xnl) - JC{xn) > ai (20) 

hold. We define Xn+i to be either a;„0 or x„l accordingly. 

Combining dMi), (nzi) and m we get 

^otlo^n “t” ^ ^ollo^n T ^ (^^) 

“t” ^ bi^iXji C (22) 

for every n G N. Since (6 q, 0) and (0, bi) lie below where V is parallel to I and 
passes through A, we have 



QfoOo + oiiai 

bo = bo 

ao 

aotto + aiUi 
bi = di 



(23) 

(24) 
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where i5o, <5i > 0. If we multiply (|2H by ao/ai, fl22ll by ai/ao and then add them 
together, we obtain 

, CiiSo 

jio^n “t” lll^n ^ 
ai oo 

where Ci > 0 is a constant. This is a contradiction since ao<5i/ai > 0, oi^o/ao > 
0, and at least one of the values 'ioXn, jJiiCn is unbounded. □ 
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Abstract. Predictive complexity is a generalization of Kolmogorov 
complexity. It corresponds to an “optimal” prediction strategy which 
gives a lower bound to ability of any algorithm to predict elements of a 
sequence of outcomes. A variety of types of loss functions makes it inter- 
esting to study relations between corresponding predictive complexities. 
Non-linear inequalities (with variable coefficients) between predictive 
complexity KG{x) of non-logarithmic type and Kolmogorov complex- 
ity K{x) (which is close to predictive complexity for logarithmic loss 
function) are the main subject of consideration in this paper. We deduce 
from these inequalities an asymptotic relation sup ~ 

x:l{x) = n 

when n — > oo, where a is a constant and l{x) is the length of a sequence x. 
An analogous asymptotic result holds for relative complexities K{x)/l{x) 
and KG{x) /l(x). To obtain these inequalities we present estimates of the 
cardinality of all sequences of given predictive complexity. 



1 Introduction 



A central problem considered in machine learning (and statistics) is the problem 
of predicting future event Xi based on past observations X\X 2 ■ ■ - Xi-i, where 
i = 1,2.... The simplest case is when Xi is either 0 or 1. A prediction algorithm 
makes its prediction on-line in a form of a real number pi between 0 and 1. We 
suppose that the quality of prediction is measured by a specific loss function 
X{xi,pi). The total loss of prediction suffered on a sequence of events X 1 X 2 ■ ■ - Xn 
is measured by the sum of all values X{xi,Pi), i = 1, . . . ,n. 

Various loss functions are considered in literature on machine learning and 
prediction with expert advice (see, for example, [11218110112] ). The most impor- 
tant of them are logarithmic loss function and square-loss function. Logarithmic 
loss function, X{a,p) = — logp if cr = 1 and X{a,p) = — log(l — p) otherwise, 
leads to the log-likelihood function. Square-loss function A(cr, 7 ) = {a — 7 )^ is 
important to applications. 

The main goal of prediction is to find a method of prediction which minimizes 
the total loss suffered on a sequence X 1 X 2 . . . Xi for i = 1,2 . . .. This “minimal” 
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possible total loss of prediction was formalized by Vovk [TO] in a notion of predic- 
tive complexity. The corresponding method of prediction gives a lower bound to 
ability of any algorithm to predict elements of a sequence of outcomes. Predic- 
tive complexity is a generalization of the notion of Kolmogorov complexity and 
has analogous asymptotic properties. In the case of logarithmic loss function 
predictive complexity coincides with a variant of Kolmogorov complexity [4]. 
Predictive complexity corresponding to square-loss function gives a lower limit 
to the quality of regression under square loss. 

A variety of types of loss functions defines the problem of comparative study 
of corresponding predictive complexities. By comparing predictive complexities 
corresponding to different loss functions, we compare learnability of strings un- 
der different learning environments. We continue the investigation initiated by 
Kalnishkan in [Ij . Paper [Ij provided necessary and sufficient conditions on con- 
stant coefficients a\, 02 and b\, 62 , under which the inequalities 

a\K^{x) + a 2 l{x) -I- ci > K^{x) and hiK^{x) + b 2 K^{x) < b^l{x) + C 2 

hold for some additive constants ci, C 2 . Here K^{x) and K^{x) are predictive 
complexities of different types, l{x) is the length of a sequence x. Logarithmic 
KG^°^ and square-loss KG^'^ complexities can be among and , in partic- 
ular inequality KG‘^’^{x) < \KG^°^{x) + c holds for some positive constant c. 
Converse inequalities with constant coefficients between these complexities which 
can be obtained by Kalnishkan’s method have additive term of order 0{l{x)). To 
avoid these addends we explore non-linear inequalities. These inequalities hold 
up to factors 0{logl{x)) and present relations between corresponding complex- 
ities more exactly. 

By its definition below KG^°^{x) coincides with the minus logarithm of the 
Levin’s m “a priori” semimeasure (see also |5]) which is close to Kolmogorov 
complexity K{x) up to addend 0(log/(a;)). By this reason and by general funda- 
mental importance of Kolmogorov complexity we compare KG’^‘^(x) with K(x). 

To obtain these inequalities we estimate the number of all sequences of length 
n with given upper bound k on predictive complexity (Propositional) . We deduce 
from this combinatorial estimation non-linear inequalities between Kolmogorov 
complexity and predictive complexity of non-logarithmic type (Propositions El 
E} . More advanced estimates for predictive complexity are given in Theorems E] 

na 

Main results of this paper in an asymptotic form are formulated in Theorems 
El and m We compare Kolmogorov complexity K{x) and a predictive complexity 
KG{x) of non-logaritnmic type. Theorem El asserts that 

K{x) 1 

x:l{x)=n KG{x) a 

where a is a constant (relation f{n) ~ g{n) means that lim f(n)/g{n) = 1 ). 

n — *-oo 

Theorem m gives an analogous relation between relative complexities K{x)/l{x) 
and KG{x)/l{x). 
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2 Predictive Complexity 



We consider only simplest case, where events Xi,X 2 , ■ ■ ■ ,Xi . . . are simple binary 
outcomes from {0, 1}, nevertheless, our results trivially can be extended to the 
case of arbitrary finite set of all possible outcomes {0, 1, . . . , L— 1}, where L > 1. 
It is natural to suppose that all predictions are given according to a prediction 
strategy (or prediction algorithm) pi = S{xi,X 2 , ■ ■ -Xi-i). We will suppose also 
that our loss functions are computable, and hence, they are continuous in p in 
the interval [0, 1]. The total loss incurred by Predictor who follows the strategy 
S over the first n trials is defined 

n 

LoSS5(o;iX 2 . . .rCn) = ^ A(Xj, S{xi,X2, . . .Xj_i)). 

2 = 1 



The main problem is to find a method of prediction S which minimizes the total 
loss Ls{x) suffered on a sequence x of outcomes. In machine learning theory 
several “aggregating algorithms” achieving this goal in the case of finite number 
of experts were developed |7llll2l2fR] . 

Vovk [ 8110 ] proposed a condition that is sufficient to optimal efficiency of his 
aggregating algorithm AA. This condition is a concavity of the exponent from 
the loss function considered. More precise, we fix the learning rate p > 0 and put 
j3 = G (0, 1). A loss function A(cr, p) is called ? 7 -mixable if for any sequence 
of predictions 71 , 72 , . . . and for any sequence of weights pi , p 2 , ■ • ■ with sum < 1 
a prediction 7 exists such that 



A(a,7)<log^^K/3^("’^>) (1) 

i 

for all a. By [8] the log-loss function is p-mixable for any 0 < 77 < 1, and square 
difference is also 77 -mixable for any 0 < p < 2. 

In [To] Vovk extended his AA to infinite pool of all “computationally efficient” 
experts. He introduced a notion of predictive complexity, which is a generalization 
of the notion of Kolmogorov complexity. A function KG{x) is a measure of 
predictive complexity if the following two conditions hold: 

1. KG{A) = 0 (where A is the empty sequence) and for every x there exists a 
p such that for each a KG{xa) > KG{x) + \{a,p); 

2. KG{x) is semicomputable from above, which means that there exists a 
computable sequence of simple functions KG*{x) such that, for each x, 
Kg\x) =mftKG\x). 

By a simple function we mean a nonnegative function which takes rational values 
or -koo and equals -koo for almost all x G S. 

Requirement 1) means that the measure of predictive complexity must be 
valid: there must exists a prediction strategy that achieves it. We consider the 
universal prediction strategy A{x) = p (which is possible uncomputable), where 
p = p{x) is the prediction from the item 1) of the definition of a measure of 
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predictive complexity. By definition LosSyi(a;) < KG{x) for each x. Notice that 
if > in 1) is replaced by =, the definition of a total loss function will be obtained. 
Requirement 2) means that KG{x) must be “computable in the limit”. 

The main advantage of this definition is that a semicomputable from above 
sequence KGi{x) of all measures of predictive complexity exists. This means 
that there exists a computable from sequence of simple functions KG\{x) 
such that 

1. KG\^^{x) < KG\{x) for all x; 

2. KGi{x) = inf( KG\{x) for all i,x; 

3. for each measure of predictive complexity KG{x) there exists an i such that 

KG{x) = KGi{x) for all x. 

We call i an enumeration program of KGi{x) (for details see Section E]). In 
particular, for any computable prediction strategy S an enumeration program i 
exists such that Losss(x) = KGi{x) for each finite sequence x. We can refine this 
as follows. We fix some universal programming language. Let K{S) be the length 
of the shortest program computing for any x a rational approximation of >S'(x) 
with given degree of precision. Evidently, there exists a computable function f(p) 
which transform any program p computing S into an enumerating program i = 
f{p) such that Loss 5 (x) = KGi{x). We have also K{i) = K{f{p)) < K{p) + c, 
where c is a constant. 

Let us mention some analogy with Kolmogorov complexity. In the theory 
of Kolmogorov complexity computable methods of decoding of finite binary se- 
quences are considered. By this method F we can reconstruct any finite sequence 
X using its binary program p: x = F{p). Each method of decoding F defines some 
measure of complexity Kp{x) = min{Z(p) : F{p) = x} of finite sequences x. It 
is easy to verify that this function is semicomputable from above. Kolmogorov’s 
idea was to “mix” all these measures of complexity in one “universal” measure. 
A computable sequence Fi of all methods of decoding can be constructed by the 
methods of the theory of algorithms [^. An universal method of decoding can 
be defined U{< i,p >) = Fi{p), where i is a program computing Fi, and < i,p > 
is a suitable code of a pair of programs. Then for any semicomputable from 
above method of decoding F it holds Ku{x) < Kf{x) + 0{1) for each x, where a 
constant 0(1) depends on F. We fix some Ku{x), denote it K{x), and call Kol- 
mogorov complexity of a finite sequence x. For technical reason it is convenient 
to consider prefix-free methods of decoding: if F{p) and F{q) are defined and 
distinct then codes p and q are incompatible. From this follows < 1. 

In this case the prefix Kolmogorov complexity can also be defined analytically 
(see ig) 

OO 

Kt;(x)=logi/2^r,2-^-.("), 

i=l 

where ri, X 2 , . . . is a computable sequence of nonnegative weights with sum < 1. 
For example, we can take = 2“*, i = 1,2,.... 
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Analogously the mixture of all measures of predictive complexity KGi{x) in 
the case of ry-mixable loss function is defined 

CXD 

ifG(x)=log^^r,/3^^‘("), (2) 

where We check that KG{x) is a measure of predictive complexity 

in Section E The following proposition (which is an easy consequence of the 
formulae d2])) shows that the function KG{x) defined by (OD is a measure of 
predictive complexity minimal up to an additive constant. 

Proposition 1. flOf Let a loss function \{uj,p) he computable and rj-mixable 
for some rj > 0. Then there exists a measure of predictive complexity KG{x) 
such that for any measure of predictive complexity KGi{x) 

KG{x) < KGi{x) + {\n2/rf}K{i) 

for all X, besides, a constant c exists such that 

KG{x) < Losss(a;) + {\n2/ri){K{S) + c) 

for each computable prediction strategy S and each x. 

Let some ?7-mixable loss function is given. We fix some KG{x) satisfying condi- 
tions of Proposition [1] and call its value the predictive complexity of x. 

The inequality 

KG{x) < {\n2/ri){K{x) + c) 

between complexities KG{x) and K{x) can be obtained from Proposition [T] 
where c is a positive constant depending on 7. To prove it consider prediction 
strategy S defined by x such that S{z) = Xt for each 2; of the length i — 1, where 
1 < z < Z(a;) — 1, and S{z) = 0, otherwise (here we also used the requirement 2) 
from Sectional). 



3 Bounded Loss Functions 

We prove our results for a wide class of bounded loss functions. A typical rep- 
resentative of this class is the square-loss function. We impose the following 
restrictions on a loss function \(a,p)\ 

1. 6 = inf supA(cr,p) > 0; 

P cr 

2. A(0,0) = A(l,l) = 0; 

3. the loss function X{a,p) is zy-mixable for some 7 > 0; 

4. X{a,p) is strict monotonic by p, i.e. A(0,p) > A(0,p') and A(l,p) < A(l,p') if 
p > p' for all p,p' G [0, 1] (the log-loss function and squared difference satisfy 
conditions l)-4) with 6=1 and 6 = |, accordingly); 
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5. the loss function X{a,p) is bounded (square-loss function is bounded, but 
log-loss function fails this condition). 

Denote a = A(1,0) and a' = A(0, 1). In the following we suppose without loss of 
generality that 0 < a < a'. We will use the following technical proposition. 

Proposition 2. Let a loss function \{a,p) satisfies conditions 1)~5). Then there 
exists a computable monotonically increasing function 6{e) such that S(e) > 0 if 
e > 0 and such that for each 0 < e < 1 and 0 < p < 1 if A(0,p) < a'(l — e) and 
A(l,p) < a(l — e) then A(0,p) > aS{e) and A(l,p) > aS(e). 

The proof of this proposition is strightforward. In the following it is sufficient to 
use this proposition instead of 4) and 5). 

By definition a> b > a6{e) for all 0 < e < 1. 

The normalized by a > 0 square-loss function A(a,p) = a(a — p)^ satisfies 
these conditions with a' = a and ^(e) = e^/4. 



4 Summary of Results 



In this section we summarize main results in an asymptotic form. These results 
follow from the results of next section. 

Let A(cr,p) be a loss function satisfying restrictions 1) - 5) (restrictions 4) 
and 5) can be replaced on the condition of Proposition [T]) , and let KG{x) be the 
corresponding predictive complexity. We call X{a,p) the bounded loss function. 

Let us define a worst-case ratio function 



f{n) 



sup 

x:l{x)—n 



K{x) 
KG{x) ■ 



(3) 



The next theorem follows directly from Theorem O (below) . 

Theorem 3. The worst-case ratio function f{n) defined by ^ satisfies 

lim fhT = 1, 

n— too ± log n 

This theorem estimates the deviation in the worst case between two complexities 
on all sequences of length n. The following theorem shows that an analogous 
deviation takes place for relative complexities K{x)/l{x) and KG{x)/l{x). Let 

hn{t) = sup (4) 

xGBn,t ^ 



where 

r 17/ \ KG(x) , 

Bn,t = {x\l{x) = n, < t}. 

n 

Define relative complexities comparing functions 

hft) = lim inf hn(t) 
n — *-oci 

h{t) = lim sup hn{t) 



(5) 

( 6 ) 
(7) 
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The following theorem is a direct corollary of Theorem [TO] (below). 

Theorem 4. The relative complexities comparing functions h{t) and h(t) de- 
fined by and 0 satisfy 



lim 

t^o 





= 1 . 



5 Non-linear Ineqnalities 



In this section we explore some possible connections between Kolmogorov com- 
plexity K(x) and predictive complexity KG{x). 

A very natural problem arises: to estimate the cardinality of all sequences of 
predictive complexity less than kl A trivial property of Kolmogorov complexity 
and predictive complexity for log-loss function is that the cardinality of all binary 
sequences x of complexity less than k is bigger than 2^“”^ and less than 2^ for 
some positive constant c. In the case of predictive complexity of non-logarithmic 
type the cardinality of the set of all sequences of bounded complexity is infinite. 
We can estimate the number of sequences of length n having predictive com- 
plexity less than k. We denote by ffA the cardinality of a finite set A. Let us 
consider a set 

An,k = {y\l{y) = n,KG{y) < k}. (8) 

Let X{a,p) be a bounded loss function and KG{x) be the corresponding predic- 
tive complexity. 

Proposition 5. Let 0 < e < 1 be a rational number. Then there exists a con- 
stant c such that for all n and k such that k < min{noi5(e), na(l — e)} the 
following inequalities hold 

T. (”) < < E ("''•“f >’) E (A (8) 

i<{k — c)/a i<k/b 2 </c/(a(l — e)) 



Proof. Let a sequence x of length n has no more than m ones. Consider prediction 
strategy S'( 2 :) = 0 for all z. Then by 4) there are at least ^ (") of x such that 

i<m 

KG{x) < Losss(x) -l-c < am-\-c < k, where c is a constant. Then m < {k — c)/a 
and we obtain the left-hand side of the inequality (0. 

To explain the main idea of the proof of right-hand side of (0 at first we 
consider a proof of a more simple upper bound 







( 10 ) 



which is valid for all k < bn. 

We consider a binary tree whose vertices are all finite binary sequences, and 
edges defined by all pairs {x, xO) and {x, a;l), where a; is a finite binary sequence. 
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We consider the universal prediction strategy A(x) = p defined above. By 
restriction 1) on a loss function for any x we have A(0, A{x)) > 6 or A(l, Al(a;)) > 
b. By this property we assign new labelling to edges of the binary tree using 
letters A and B. We assign A to (x,xO) and B to (x,xl) if A(0,/l(a:)) > b, and 
assign B to {x, xO) and A to (x, xl) otherwise. Evidently, two different sequences 
of length n have different labellings. For each edge (x, xa) labeled by A it holds 
X{a,A(x)) > b and, hence, for any sequence x of length n having more than m 
^s it holds KG{x) > LosSyi(x) > bm. Therefore, the bound (fT0)l holds. 

To prove the upper bound dH) assign some labelling to edges (x,xO) and 
(x,xl) of the binary tree using letters A, B and C, D as follows. For any x 
consider two cases. 

Case 1. There is an edge (x,xcr) such that A(cr, 7l(x)) > a(l — e). In this 
case we assign C to (x, xa) and D to (x, xa), where ct = 1 if cr = 0, and a = 0 
otherwise. 

Case 2. Case 1 does not hold, i.e. A(cr, A{x)) < a(l — e) for all a. In this case 
we assign the letter A to (x,x0) and letter B to (x,xl) if A(0,yl(x)) > b and 
assign these letters vise versa, otherwise. 

Evidently, two different sequences of length n have different labellings. 

If some edge (x,xa) labeled by C then \{a,A{x)) > a(l — e) and, hence, for 
any path x of length n having more than C it holds KC{x) > 

Loss /I (x) > k. 

By definition if some edge (x,xa) labeled by A or by B then A(cr, yl(x)) < 
a(l — e) for all a. Then by Proposition |2] we have \{a,A{x)) > aS{e) for all a. 
Hence, for any path x of the length n having more than k/{aS{e)) letters A or 
B it holds KC{x) > LosSyi(x) > k. 

Hence, any sequence x of length n, on which KC{x) < k, can have no more 
than k/{aS{e)) letters A or B and no more than a{i-e) l^tt^rs C, the rest part 
of X are letters D. It also has no more than | letters A. 

By means of this labelings, every sequence x G An^k can be recovered from the 
following pair {a, (3) of sequences. The first element of this pair is the sequence 
a of all letters A and B assigned to edges on x in the original order. This 
sequence contains no more than ^ letters A. It is also can not be longer than 
k/{aS{e)). The second element of the pair is the sequence /3 of all letters C and 
D assigned to edges on x in the original order. This sequence contains no more 
than a{i-t) letters C. Given these two sequences (a,/3), the whole sequence x 
can be recovered as follows. Let x*“^ = Xi . . . Xi_i, where 1 < f < n, be already 
recovered by some initial fragments and of sequences a and /3. We 
can place x*“^ in the binary tree supplied by new labellings and so define letters 
assigned to edges (x®“^,x*“^0) and (x*“^, x®“^l). Comparing these letters with 
Qfs and (iq we can define which sequence must be used in recovering of the next 
member of x. The corresponding letter as or j3q of this sequence determines the 
member Xi of the sequence x. 

Note, that the labelling and, hence, our method of recovering are incom- 
putable. It gives us only a possibility to estimate the number of elements of the 
set An^k- The method of recovering shows that to do this, it is enough to estimate 
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the number of all such pairs (a,P). It can be estimated as follows: 



#An,k < 

i<kjh 



^fc/(a5(e)) 



E 

(a(l — e)) 



□ 

Note, that upper bound (0 is valid for k much smaller than n for small e. 

Proposition 6. Let 0 < e < 

— (i) If in addition e < 5 then a positive eonstant c exists sueh that for all x 



K{x) < 



KG{x) 
a(l - e) 



logl{x) - log 



2 log 



aS(e) 

b 



KG{x) 
4a(l — e) 
KG{x) 
b 



+ c. 



( 11 ) 

(12) 



— (a) For all suffieiently large n for all x of length n if KG{x) < §a(l — e) 
then 

+ (13) 

n \na{l — e) / \ b J bn n 

where H{p) = — plogp — (1 — p) log(l — p) is the Shannon entropy. 



Proof. Let us consider the recursively enumerable set An,k defined by ([S| above. 
We can specify any x € An^k by n, k and the ordinal number of x in the natural 
enumeration of An^k, i-e. K{x) < log ffAn^k + 21 ogn + 21 ogfc + c, for some 
constant c. After that we make some transformations of the upper bound (E) of 
Proposition O and replace k on KG{x). 

We will use the following estimates of the binomial coefficients from [3], 
Section 6.1. 




(14) 



and estimates 



E 

2<m 




< (m + 1 ) 




log 




< nH 



(9 



(15) 

(16) 



for any m < ^ and s <n. We use also inequality 



H{p) 

P 



< — 21 ogp 



for all 0 < p < ^ . 

Let k < fa(l — e). We have also | < 5 ^^;^ for all e < <5~^(^)- 



( 17 ) 



Non-linear Inequalities between Predictive and Kolmogorov Complexities 



199 



To prove inequality m let us consider the recursively enumerable set An^k 
defined by ([HI). We can specify any x G An,k by n, k and the ordinal number of x 
in the natural enumeration of A^^k- Using an appropriate encoding of all triples 
of positive integer numbers by upper bound @ of Proposition [5] and using (fT4l) . 
(HI), (HU), (UZl) we obtain for all x G An^k 



K (x) < log ij^An^k + 2 log n -I- 2 log fc -I- c < 



1 ^ A/M(e)) 

b \ k/b 



Tog 



n 



a(l — e) \fc/(a(l — e))^ 

2 log n -I- 2 log k + c< 



k k 

log V + -tttH 
b ao(e) 



aS{e) 



log 



.(1-e) 



log 



en 



fe/(a(l-£)) 



fc/(a(l - e)), 

2 log n -I- 2 log fc -I- c' = 



, k k 
log T + 

b ad(e) 



iS(e) 



log 



k 



k 



a(l-e) 



a(l-e) 



log n + log e — log 



a(l-e) 
k 



a(l - e), 

2 log n -I- 2 log fc -I- c' < 
k 






(18) 

(19) 

( 20 ) 
(21) 

( 22 ) 

(23) 

(24) 

(25) 

(26) 

(27) 

(28) 



where c, c' and d' are positive constants. 

Put k = KG{x). Then by inequalities fl8l) - (l28t . we obtain in the case 
KG{x) < \na{l — e) -b 2a(l — e) = a(l — e)(^ + 2) the following inequality 



K{x) < 



/ KG{x) 
Va(l-e) 





log 



KG{x) \ 
4a(l - e)j 



2KG{x) 

b 




+ c (29) 



for some positive constant c. We can omit the term -|-2 in (1^ . since KG{x) is 
defined up to an additive constant. 

Consider two strategies 5*1 (z) = 0 and S 2 {z) = 1 for all z. Then for each x of 
length n it holds Losssj(a;) < |n or Losss^{x) < |n. Therefore the inequality 
KG{x) < \an + c holds for some positive constant c. If KG{x) > f a(l — e) we 
have for all n and all x of length n 
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K{x) < n-b21ogn-bCi < -logn- 2 3 



KG{x) 
a(l - e) 



log n — log 



KG{x) 

4a(l-e) 



C2 < 



C3, 



(30) 

(31) 



where ci, C2, C3 are positive constants. Inequality (Tni . (TT^ follows from (1^ . 
(EU when e < 5. Item (i) is proved. 
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Let us consider the item (ii). In the case KG{x) < f a(l — e) inequality (IT^ 
can be obtained by applying inequality (1161 1 to the second binomial coefficient 
of m as follows. 

where c is a positive constant. 

Putting k = KG{x) in (l32l l and dividing on n we obtain for any e < 
for all sufficiently large n 

K{x)^ f KG{x) \ KG{x) ( a5{t) 

n ~ \na(l — e) / naS{e) \ b 

KGjx) \ _ KG{x) ^ 71ogn 

na(l — e) / \ b J bn n 

□ 




71ogn 



< 



Proposition 7. Let 0<7<1, 0<e< Then a positive eonstant c 

exists such that for each sufficiently large n and each k < ^na{l — e) a binary 
sequence x of length n exists such that 



k(l — 7)(1 — e) < KG{x) <k + c, 

K{x) > log f - 1 > -21ogn 

\k/aj \ an J 



and also 



, KGix) / , KGix) 

K (x) > I log n — log 



- 2 . 



(33) 

(34) 



(35) 



Proof. We will find x satisfying the condition of this proposition in the set An^k 
defined by (0. We must estimate minimal k' such that fAn^k ((k-c)/J > 
2ffAn k’- We will show that this inequality holds for all sufficiently large n if 
k' = (fc — c)(l — 7)(1 — e), where c is a constant from lower bound ([9|. By 
incompressibility property of Kolmogorov complexity (see m and lower bound 

m an X € Aj.i k — Aj.i k' exists such that 7f(x) fog (^^ c)/a} — that, 

using appropriate estimates of binomial coefficients and replacing k on k — c we 
obtain inequalities (l33t . (l34l l and (l35l) . 

We will find x satisfying the condition of this proposition in the set An^k 
defined by We must find some k' such that fA^^k > j,/. 

By the upper and lower bounds (0 of Proposition 0 it is sufficient that k' be 
satisfy 




where c is a constant from the lower bound ®. 
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We will find k' satisfying k' < §a(l — e). By (fTdl) inequality ll36ll follows from 

/ no \ eh \ / ena(l - e) Wd-«) k’ 

\k-c) - b \ k' ) a(l-e)' ^ ^ 

Inequality (ED holds for all sufficiently large n ii k' = {k — c)(l — 7)(1 — e). 
Then for each sufficiently large n we have > 2^An^k' and 

(fc — c)(l — 7)(1 — e) < KG{x) < k (38) 

for all X G An^k — An^k'- We have also k' < fa(l — e) if fc < ^ 710(1 — e) -I- c. 

By incompressibility property of Kolmogorov complexity we have that an 
X G An^k — An^k’ exists such that 

K(x)>log( ^ ^-2>nH(^ — ^^-21ogn. (39) 

\[k — c)/aj \ an J 

Here we used the last inequality on the page 66 of [S]. We obtain also by in 

N \ „ k — c, k — c, k — c „ 

K{x)>\og[ “2> logn log 2= (40) 

\{k — cj/aj a a a 

k — c f, , k — c\ ^ 

logn-log -2. (41 

a \ o, J 



Now replacing in the proof of the proposition A: on fc-|-c and putting k = KG{x) 
we obtain from (l39l) and (SU inequalities (IMI) and (l35t . Inequality (l33l) follows 
from (l38l) . □ 

The next corollary from propositions El and Ogives precise relations between 
normalized Kolmogorov and predictive complexities. This result is too technical 
and it is reformulated in the Section 0 in a more convenient form. 

Corollary 8. Let 0 < e < Then for all sequences x of sufficiently large 

length if KG{x) < ^na(l — e) then 



K(^ < f KGjx) \ _ f^\ KGjx) nogljx) 

l(x) - \a{l-e]l{x)) b ) bl{x) l{x) 

and for each suffciently large n there is some x of length n such that 

K{x) ^ f KG{x) \ _ 2\ogl{x) 
l{x) ~ \ al{x) ) l{x) 

Proof. This corollary follows from (fT^ and (l34i . □ 

Theorem 9. Let 0 < e < min{|,(5“^(^)}. Then there exists a constant c such 
that for all n 



1 12 

-logn-c< /(n) < — rlogn- -log(5(e)-bc 

a a(l — ej b 

where f{n) is the worst-case ratio function defined by 
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Proof. The right-hand inequality follows directly from CH). The left-hand in- 
equality can be derived from ll33l) and (l35l) of Proposition 0 It is enough to let 
k = rf . Taking e — > 0 we obtain the needed inequality. □ 

Theorem 10. Let 0 < e < Then for each real number t < |a(l — e) 

H(i)<S(i)<S(<)<H(jp^)-5log^i- (42) 

where hft) and h(f) are relative complexities comparing functions defined by 
and 

Proof. This theorem follows directly from Corollary [HI □ 
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Appendix: Proof of Proposition [T] 

A semicomputable from above sequence KGi{x) of all measures of predictive 
complexity satisfying items l)-3) of SectionOcan be defined as follows. We will 
consider the recursively enumerable (r.e.) sets as consisting of pairs (x, r), where 
X is a finite binary sequence and r is a nonnegative rational number (all such pairs 
can be effectively encoded using all natural numbers). Let W be an universal r.e. 
set such that for each r.e. set A (consisting of pairs (x, r) as mentioned above) 
there exists a natural number i such that A = Wi = {(x, r) |(i, x, r) G W}. The 
existence of this set is the central result of the theory of algorithms (see Rogers 
0 ). 

By computability of A(a,p) a computable sequence of simple functions 
\*{a,p) exists such that A*+^(cr,p) < \^{(j,p) for all t, a, p and X{cr,p) = 
inft X\a,p). 

Let be a finite subset of W enumerated in t steps. Define 

W* = {(x, r)|3r'((i, x, r') G W*, r > r')} U (S' x {-boo}). 

It is easy to define a computable sequence of simple functions KG\{x) such that 
KG^{x) = oo and KG^j^^{x) < KG\{x) for all x. Besides, KG\{x) is a minimal 
(under <) simple function whose graph is a subset of Wf and such that for each 
X a rational p exists for which 

KGl{xa)-KGl{x)>X*{a,p) (43) 

holds for each cr = 0, 1. Define KGi{x) = infj KG\{x) for each i and x. It follows 
from (1431) and continuity of A(cr, p) in p that for any i the function KGi{x) is a 
measure of predictive complexity. 

Let a function KG{x) satisfies the conditions (i), (ii) of the definition of a 
measure of predictive complexity and Wi = {(x, r)|r > KG{x)}, where r is a 
rational number. It is easy to verify that KG{x) = KGi{x) for all x. 

Let Xi be a semicomputable from below sequence of real numbers such that 

OO 

^ Xi < 1. For instance, we can take where K{i) is the Kolmogorov 

i—1 

prefix complexity of i. 

We prove that KG{x) defined by ([Hi is a measure of predictive complexity. 
By definition KG{x) is semicomputable from above, i.e (ii) holds. We must verify 
(i). Let (3 = e~^, p is a learning rate. Indeed, by m for every x and j = 0, 1 

OO 

KG{xj) - KG{x) = log^ > 

i-1 



(44) 
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log^ > A(j,7), (45) 

i=l 



where 

C30 

g r^pKG^ix) 
s=l 

Here for any i a prediction ■ji satisfying 

KGi{xj) - KGi{x) > A(j, 7 *) 

exists since each element of the sequence KGpx) satisfies the condition (i) of 
the measure of predictive complexity. A prediction 7 satisfying 6S) exists by 
77 -mixability. For further details see m, Section 7.6. 
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Abstract. The present work is dedicated to the study of modes of data- 
presentation between text and informant within the framework of induc- 
tive inference. The model is such that the learner requests sequences of 
positive and negative data and the relations between the various for- 
malizations in dependence on the nnmber of switches between positive 
and negative data is investigated. In particular it is shown that there is 
a proper hierarchy of the notions of learning from standard text, in the 
basic switching model, in the newtext switching model and in the restart 
switching model. The last one of these tnrns out to be equivalent to the 
standard notion of learning from informant. 



1 Introduction 

One central question studied in inductive inference is the relation between learn- 
ing from all data, that is learning from informant, and learning from positive 
data only, that is, learning from text. Learning from text is much more restrictive 
than learning from informant; already Gold |Z] gave an easy example of a class 
which can be learned from informant but not from text: the collection consisting 
of one infinite set together with all its finite subsets. Sharma m showed that 
combining learning from informant with a restrictive convergence requirement, 
namely that the first hypothesis has already to be the correct one, implies also 
learnability from text, provided that then the usual convergence requirement is 
applied and the hypothesis may be changed finitely often before converging to 
the correct one. 

The main motivation for this work is to explore the gap between these two 
extreme forms of data-presentation. Previous authors have already proposed sev- 
eral methods to investigate this gap: Using non-recursive oracles as an additional 
method cannot completely cover the gap since even the most powerful ones do 
not permit to learn all sets jS] while the oracle K does this for learning from in- 
formant; restrictions on the texts reduce their non-regularity and permit to pass 
on further information implicitly PTi7| ; strengthening the text by permitting 
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** Supported by the Deutsche Forschungsgemeinschaft (DFG) under the Heisenberg 
grant Ste 967/1-1 
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additional queries |llj to retrieve information not contained in standard texts. 
Ascending texts permit to reconstruct the complete negative information in the 
case of infinite sets but might fail to do so in the case of finite sets; thus the 
class of one infinite set and all its finite subsets still is unlearnable from ascend- 
ing text. 

Motoki m and later Baliga, Case and Jain |2j added to the positive infor- 
mation of the text some but not all negative information on the language to be 
learned. They considered two notions of supplying the negative data: (a) there 
is a finite set of negative information S C L such that the learner always suc- 
ceeds learning the set L from input S plus a text for L; (b) there is a finite set 
SQL such that the learner always succeeds learning the set L from a text for 
L plus a text for a set H disjoint to L which contains S, that is, which satisfies 
S Q H Q L. As one can in case (a) learn all recursively enumerable sets by a 
single learner, the notion (b) is the more interesting one. 

The present work treats positive and negative data symmetrically and several 
of its notions are much less powerful than those notions from |2] just discussed. 
The most convenient way to define these notions is to use the idea of a minimum 
adequate teacher as, for example, described by Angluin [T]. A learner requests 
positive or negative data-items from a teacher which has - depending on the 
exact formalization - to fulfill certain requirements in the limit. These formal- 
izations and also the number of switches permitted define then the model. The 
naturalness of this approach is witnessed by the fact that all classes separating 
the various formalizations can be defined in easy topological terms. Thus these 
separating classes are as fundamental as Gold’s text-non-learnability example 
in the sense that they witness the same separations also in the case that the 
learners may use non-recursive oracles. 

2 Notation and Preliminaries 

Notation. Any unexplained recursion theoretic notation is from [M]. The sym- 
bol Nat denotes the set of natural numbers, {0, 1, 2,3,.. .}. Symbols 0, C, c, 3, 
and D denote empty set, subset, proper subset, superset, and proper superset, 
respectively. Cardinality of a set S is denoted by card(S'). 

dom(? 7 ) and range(r 7 ) denote the domain and range of partial function rj, re- 
spectively. Sequences are partial functions t] where the domain is either Nat or 
{y G Nat : y < x} for some x. In the first case, the length of rj (denoted |? 7 |) 
is oo, in the second case its length is x. Although sequences may take a special 
value # to indicate a pause (when considered as a source of data), this pause- 
sign is omitted from range(? 7 ) for the ease of notation. Furthermore, if x < |cr|, 
then (j[x\ denotes the restriction of a to the domain {y G Nat : y < x}. We let 
? 7 , <j and r range over finite sequences. We denote the sequence formed by the 
concatenation of r at the end of cr by err. Furthermore, we use ax to denote the 
concatenation of sequence a and the sequence of length 1 which contains the 
element x. 

By ip we denote a fixed acceptable programming system for the partial com- 
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putable functions mapping Nat to Nat miM- By (fii we denote the partial 
computable function computed by the program with number i in the (^-system. 
Such a program j is a (characteristic) index for a set L if ipi{x) = 1 for x € L and 
<Pi(x) = 0 for a; ^ L; programs for enumeration procedures (so call r.e. indices) 
are not considered in the present work. From now on, we call the recursive sub- 
sets of Nat just languages and do therefore only consider characteristic indices 
and not enumeration procedures. The symbols L,H range over languages and 
L denotes the complement Nat — L oi L. The symbol C ranges over classes of 
languages. 

Although learning theory also often considers learning non-recursive but still 
recursively enumerable set, we restrict ourselves to the recursive case since, for 
notions of learning considered in this paper, this case already permits to con- 
struct all counterexamples for interesting separations of learning criteria while 
all inclusions hold for the case of recursive sets iff they hold for the case of re- 
cursively enumerable sets. This is due to the fact that all proofs use mainly the 
information-theoretic properties but not on recursion-theoretic properties of the 
concepts. Furthermore, recursive sets have compared to recursively enumerable 
sets the advantage, that also their complement possesses a recursive enumera- 
tion, see Remark 

Notation from Learning Theory. The main scenario of inductive inference 
is that a learner reads more and more data on an object and outputs a sequence 
of hypotheses which eventually converge to the object to be learned. The for- 
malization of the data-presentation uses the concept of a sequence introduced 
above and such data sequences are usually denoted by T (for “text”). 

Definition 2.1. [l] A text T for a language L is an infinite sequence such that 

its range is L, that is, T contains all elements of L but none of L. T[n] denotes 
the finite initial sequence of T with length n. 

Definition 2.2. 13 A learner (or learning machine) is an algorithmic device 

which computes a mapping from finite sequences into Nat. 

We let M range over learning machines. M(T[n]) is interpreted as the index 
for the language conjectured by the learning machine M on the initial se- 
quence T[n\. We say that M converges on T to i, (written M(T) | = z) iff 
(V°“n) [M(T[n]) =f]. 

There are several criteria for a learning machine to be successful on a lan- 
guage. Below we define learning in the limit introduced by Gold [Tj- 

Definition 2.3. |3 (a) M TxtEx-learns a text T for a language L iff almost 

all outputs M(T[n]) are the same index i for L. 

(b) M TxtEx-learns a language L (written: L € TxtEx(M)) just in case M 
TxtEx-learns each text for L. 

(c) M TxtEx-learns a class C of recursive languages (written: C C TxtEx(M)) 
just in case M TxtEx-learns each language from C. 

(d) TxtEx is the collection of all classes C which have a computable TxtEx- 
learner. 
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The following propositions on learning from text are useful in proving some of 
our results. 

Proposition 2.4. Based on Proposition 2.2A] Let L be any infinite set 

and Pos be a finite subset of L. Then {H : Pos C H <Z L A eard{L — H) < 1} 

^ TxtEx. 

Proposition 2.5. 0 Let L be any infinite language. If C contains L and the 

sets L n {0, 1, . . . , n}, for infinitely many n G Nat, then C ^ TxtEx. 

We now generalize the concept of learning and permit the learners to request 
explicitly positive or negative data from a teacher in order to define learning by 
switching between type of information received. 

Definition 2.6. Learning is a game between a learner M and a teacher T. 
Both send alternately informations in the following way: in the fc-th round, the 
learner first sends a request G {+,—}. Then the teacher answers with an 
information Xk- Afterwards the learner outputs a hypothesis Ck- There are three 
types of interactive protocols between the learner and the teacher; every teacher 
satisfying the protocol is permitted. 

(a) The basic switch-protocol. The teacher has two texts T+,T_ of L and L, 
respectively. After receiving rj, the teacher transmits Tr^.{k). 

(b) The restarting switch-protocol. The teacher has two texts T+,T_ of L 
and L, respectively. After receiving the teacher computes the current position 
I = card({/i : 0<h<kArh = rk}) and transmits Tr,^{l). 

(c) The newtext switch-protocol. The teacher always either sends an Xk G 
L U {ff} if rp; = -I- and Xk & LU {ff} if = — • Furthermore, if there is a fc such 
that r/j = r/j, for all h > k, and either k = 0 or rk-i Vk, then the sequence 
Xk,Xk+i , ... is a text for either L (if rk = +) or L (if rk = —)■ 

A class C is learnable according to the given protocol iff there is a computable 
machine M such that for every L £ C and for every teacher satisfying the 
protocol for this L, the hypotheses of the learner M converge to an index e of L. 
The corresponding learning-criteria are denoted by BasicSwEx, RestartSwEx 
and NewSwEx, respectively. 

Note that M is a TxtEx-learner iff M always requests positive data {rk = + 
for all k). Therefore, all three notions are generalizations of TxtEx-learning. 

In the following we define similar restrictions on the number of switches as 
has been done for the number of mind changes We also consider counting 
switches by ordinals: the learner has a counter for an ordinal, which is down- 
counted at every switch and which due to the well-ordering of the ordinals can be 
downcounted only finitely often. In order to guarantee that the learner is com- 
putable, we consider throughout this work only recursive ordinals. In particular, 
we use a fixed notation system, Ords, and a partial ordering of ordinal notations. 
A method to define such a notation is, for example, given by Kleene mm- 
Let and on ordinal notations below refer to the partial ordering of 

ordinal notations in this system. We do not go into the details of the notation 
system used, but instead refer the reader to 
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Definition 2.7. BasicSw*Ex denotes the variant of BasicSwEx, where also 
the requests of M have to converge to some r whenever M deals with a teacher 
following basic switch-protocol, for any given L G C. 

For an ordinal notation a, the variant BasicSw^Ex denotes that the learner 
is equipped with a counter for an ordinal notation and that the number of 
switches is restricted by requiring that the counter is initialized as a and that it 
is downcounted exactly whenever a switch occurs: If r^+i = Tk then Ofe+i = au 
and if rk+i yf then ak+i -< Ofe where ak,ak+i are the values of the ordinal 
counter at rounds fc, fc -I- 1. 

Similarly one defines RestartSw*Ex, NewSw*Ex, RestartSw^Ex and 
NewSwaEx. It is furthermore possible, to use other convergence-criteria than 
Ex for the hypotheses. 

Remark 2.8. These notions might change a bit if instead of arbitrary texts 
some restrictive variants are used. 

A fat text is a text whose elements occur all infinitely often. Therefore, arbi- 
trary long initial segments of the text may be missing without loosing essential 
information. Therefore, one can to a certain degree compensate the loss of in- 
formation when switching in a basic text: BasicSw*Ex = NewSw*Ex when 
only fat texts are permitted. The notions NewSw*Ex and RestartSw*Ex do 
not change if one feeds fat texts instead of normal texts, but the notion of 
BasicSw*Ex increases its power and becomes equivalent to NewSw*Ex. The 
same applies to the notions obtained by bounding the number of switches by 
ordinals. 

A recursive text does not give an advantage for classes of recursive languages, 
since one can construct counterexample texts always such that they are recur- 
sive. As all separations constructed below involve only classes of recursive sets, 
these separations remain valid if the texts are required to be recursive. 

Gold [ 7 ] showed that the class of all recursively enumerable sets can be 
learned from primitive recursive text which has to be generated by a primitive 
recursive function. Then, no switches are necessary at all and all the notions 
coincide with Gold’s notion of learning from text. 

Remark 2.9. Let C contain the four subsets of {0,1}. This class is TxtEx- 
learnable but it is not learnable by a BasicSwEx-learner which is required to 
make at least one switch on every possible data-sequence. 

To see this, assume that the learner starts with requesting positive examples, 
then 0°° is a valid text for {0} and the learner has to make eventually on it a 
switch after some (say n) examples. But then the learner cannot distinguish the 
sets {0} and {0,1}: In case {0}, let = 0°° and T_ = 123 ... and in case 
{0, 1}, let T+ = 0" 1 0°° and T_ = 2 2 3 . . .; so the T_ differ at the first position 
and the T+ differ at the n + 1-st position. Both positions are not seen by the 
learner and so the learner cannot find out, which case holds. If the learner starts 
by requesting negative data, it can be trapped similarly. 

Although BasicSwEx is more powerful than TxtEx, it still has a severe 
restriction since information might be lost and it might happen, that a given 
learner receives - due to switches - a data sequence which satisfies the protocol 
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for several possible languages. This cannot occur for the criteria of NewSwEx- 
learning and Restart SwEx- learning which from this point of view are more 
natural. 



3 Basic Relations between the Concepts 

Within this section, we investigate the basic relations between the various criteria 
of learning by switching type of information. 

Proposition 3.1. 

(a) For all ordinals a, BasicSw^Ex C NewSwc,Ex C RestartSwc,Ex. 

(b) BasicSw*Ex C NewSw*Ex C RestartSw*Ex. 

(c) BasicSwEx C NewSwEx C RestartSwEx. 

Proof. First note that any teacher using the newtext switch-protocol also sat- 
isfies the basic switch-protocol. Also note that any teacher using the restart 
switch-protocol can be easily modified to give answers using a newtext switch- 
protocol ~ by appropriately repeating the already given positive / negative ele- 
ments before giving any new elements presented in the restart switch-protocol. 
The proposition follows. □ 

In the following it is shown that the hierarchy from Proposition 13.1 1 fcl is strict, 
that is. 



TxtEx C BasicSwEx C NewSwEx C RestartSwEx. 

Besides this main goal, the influence of restricting the number of switches to be 
finite or even to respect an ordinal bound, is investigated. 

Note that the inclusion TxtEx C BasicSwoEx follows directly from the 
definition. Furthermore, the class {L : card(L) < 1}, using Proposition 12.41 
is not TxtEx-learnable but since it contains only cofinite sets it can be 
learned via some learner requesting always negative data. Thus the inclusion 
TxtEx C BasicSwoEx is strict. 

Combining finite and cofinite sets is the basic idea to separate newtext switch- 
ing from basic switching using parts (a) and (c) of Theorem 13 . 2| below . The class 
used to show this separation is quite natural and thus also interesting on its own 
right: 

'C/m,co/m = {L : card(L) < oo or card(L) < oo}. 

Theorem 13.21 below also characterizes the optimal number of switches needed 
to learn Cfin,cofin (where possible): one can do it for the criteria NewSw*Ex 
and Restart Sw* Ex with finitely many switches but an ordinal bound on the 
number of switches is impossible. 

Theorem 3.2. (a) Cfm, cofin G NewSw*Ex. 

(b) For all ordinals a, jCfin,cofin ^ RestartSw^Ex. 

(c) Cf^n,cofin ^ BasicSwEx. 
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Proof, (a) The machine M works in stages. At any point of time it keeps track 
of elements in L and L that it has received. 

Construction. 

Initially let Pos = 0, Neg = 0 and go to stage 0. 

Stage s: If |Pos| < |Neg| 

Then request a positive example x] 
update Pos = Pos U {a:} — {#}; 
conjecture the finite set Pos; 

Else request for negative data x\ 
update Neg = Neg U {x} — {#}; 
conjecture the cofinite set Nat — Neg. 

Go to stage s + 1 

It is straight forward to enforce that the learner always represents each conjec- 
tured set with the same index. Having this property, it is easy to verify that M 
NewSw*Ex-learns C fin, cofin- 

(b) Suppose by way of contradiction that M RestartSw^Ex-learns the class 
^ fin, CO fin- Since every finite sequence of data can be extended to the one of a set 
in £ fin, CO fin, M has to behave correctly on all data sequences and does not switch 
without downcounting the ordinal. There is a minimal ordinal /3 which M can 
reach in some downcounting process. For this (3, there is a corresponding round 
k, a sequence of requests by M and a sequence of answers given by a teacher 
such that M’s ordinal counter is [3 after the fc-th round; let Pos be the positive 
data and Neg be the negative data provided by the teacher until reaching f3. As 
(3 is minimal, M does not make any further downcounting but stabilizes to one 
type request, say to requesting positive data; the case of requesting only negative 
data is similar. Let L — Neg. If H satisfies Pos Q H C L and card(£ — H) < I 
then H is cofinite and M is required to learn H without a further switch. So 
M would be a TxtEx-learner for {H : Pos C H C L A card(£ — H) < 1}, a 
contradiction to Proposition 12.41 

(c) Suppose by way of contradiction that M BasicSwEx-learns Cfin,cofin- Due 
to symmetry-reasons one can assume that the first request of M is -I- and assume 
that the teacher gives #. Now consider the special case that T_ is either or 

for some number y. The set to be learned is either Nat or Nat — {y} and 
the only remaining relevant information is the text T+. Thus if one could learn 
k^fin,cofin Under the criterion BasicSwEx, then one could also TxtEx-learn 
the class {L : card(£) < 1}, a contradiction to Proposition 12.41 □ 

The result in (c) can be improved to show that even classes which are very easy 
for NewSwEx cannot be BasicSwEx-learned. 

Corollary 3.3. NewSwiEx % BasicSwEx. 

Proof. The proof of Theorem 13.21 (c) shows even that the class 
{L : card(£) < 1 or card(£) < 1} 
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is not BasicSwEx-learnable; the sets with card(L) < 1 are added for the case 
that the request rg in the proof of (c) is — . It remains an easy verification that 
the considered class is NewSwiEx-learnable: A machine first asks for positive 
examples and outputs an index for the set consisting of the examples seen so 
far, unless it discovers that there are at least two elements in the language. At 
which point it switches to requesting negative examples to find the at most one 
negative example. □ 

The following theorem shows the strength of restarting switch protocol by show- 
ing that it has the same learning power as the criterion InfEx where the learner 
gets the full information on the set L to be learned by reading its characteristic 
function instead of a text for it |C]. 

Theorem 3.4. RestartSwEx = InfEx. 

Proof. Clearly, RestartSwEx C InfEx. In order to show that InfEx C 
RestartSwEx, we show how to construct an informant for the input language 
using a teacher which follows the restart switch-protocol. Clearly, this suffices 
to prove the theorem. The learner requests alternatingly, positive and negative 
information. This gives the learner a text for L as well as for L, which allows 
one to construct an informant for the input language L. □ 

The following theorem shows that newtext switching protocol can simulate 
restart switching protocol, if the number of switches is required to be finite. 

Theorem 3.5. For all ordinals a, RestartSw^Ex = NewSw^Ex. 
RestartSw*Ex = NewSw*Ex. 

Proof. By Proposition 13.1 1 it suffices to show the direction that RestartSw^Ex 
C NewSwo,Ex, and RestartSw*Ex C NewSw*Ex. Note that for languages 
in the class being learned, if the machine makes only finitely many switches, 
then any teacher following the newtext switch-protocol also follows the restart 
switch-protocol. Theorem follows. □ 

In contrast to Theorem 13. 51 the following theorem shows the advantage of restart- 
ing switching protocol, compared to newtext switching protocol if the number 
of switches is not required to be finite. 

Theorem 3.6. RestartSwEx % NewSwEx. 

Proof. Let £ contain all finite variants of the set Odd of odd numbers. 
Clearly, £ G RestartSwEx = InfEx. Suppose by way of contradiction that 
£ G NewSwEx as witnessed by M. Let Even denote the set Nat — Odd of even 
numbers. We then consider the following cases. 

Case 1: There exists a way of answering the requests of M such that positive 
requests are answered by elements from Odd, negative requests are answered by 
elements from Even — {0} and M makes infinitely many switches. 

In this case, clearly M cannot distinguish between the cases of input language 
being Odd and input language being Odd U {0}. 
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Case 2: Not case 1. Let xq,xi, . . . ,Xk be an initial sequence of answers such that 

— for z < k, if Xi = +, then Xi G Odd, 

— for z < fc, if , then Xi G Even — {0}, 

— if the teacher is consistent with Odd and OddU {0}, then M does not make 
a further switch, that is, the following two conditions hold: 

• if rfe_|_i = + and the teacher takes its future examples Xk+i,Xk+ 2 , - ■ ■ 
from the set Odd then rj = r^+i for all j > k; 

• if Xk+i = — and the teacher takes its future examples Xk+i, Xk+ 2 , ■ ■ ■ 
from the set Even — {0} then Xj = Xk+i for all j > k. 

Note that there exists such k, Xq,xi, . . . ,Xk since one could otherwise construct 
an infinite sequence as in case 1 by infinitely often extending a given sequence 
such that infinitely many switches occur. 

Case 2a: Xk+i = +■ 

In this case, M has to learn the set Odd and every set Odd — {2x + 1} 
where 2x+ 1 ^ {a;o, Xi, . . . , Xk} from positive data. This is impossible by Propo- 
sition EH 

Case 2b: Xk+i = — . 

This is similar to Case 2a. M needs to learn the set Odd and every set Odd U 
{2x} with 2x ^ {0,a;o,a::i, . . . ,Xk\ from negative data. Again this is impossible 
by symmetric version of Proposition 12.41 □ 

The previous result completes the proof that all inclusions of the hierarchy 
TxtEx C BasicSwEx C NewSwEx C RestartSwEx are proper. 

4 Counting the Nnmber of Switches 



Theorem EH and Corollary 14.21 show a hierarchy with respect to number of 
switches. 

Theorem 4.1. Fox a)^ (3, NewSw^Ex % RestartSw^Ex. 

Proof. Extend ^ to Ords U {—1} by letting —1 ^ /3, for every /3 G Ords. There 
is a computable function od from Nat to Ords U {—1} such that 

— for every l3 ^ a there are infinitely many x G Nat such that od(x) = /3; 

— there are infinitely many x G Nat such that od(x) = — 1; 

— the set {{x,y) : od(x) ^ od{y)} is recursive. 

A set E = {xi, X 2 , . . . , Xfc} C Nat is a-admissible iff 

— 0 < Xi < X2 < . . . < Xk', 

— a >- od(xi) od(x 2 ) od(xfc) E —1. 
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The empty set is also a-admissible, but no infinite set is since the second con- 
dition postulates a descending chain of ordinals which is always finite. Now the 
class L is defined by the equations 

Lp = {x : card({0, 1 , . . . , a;} n F) is odd}; 

Ca = {Lp '■ F is Qf-admissible}. 

Note that the set L^h is just 0. Now it is shown that the class witnesses the 
separation. 

Claim. Ca G NewSw^Ex. 

Proof of Claim. The machine M has variables n for the number of switches 
done so far, E for the finite set of examples seen after the last switch, m„ for 
the maximal element seen so far and 7„ the value of the ordinal-counter after n 
switches. The initialization before stage 0 is F = 0, n = 0, mo = 0 and jo = 
maxordinais Y just denotes the maximum element of a non-empty finite set Y of 
ordinals with respect to their ordering. 

Construction. Stage s (what is done when the s-th example x is read). 

(1) If n is even, request a; to be a positive example x; 

If n is odd, request cc to be a negative example x. 

(2) If a; ^ {#, 0,1 ,..., m„| and X = {y < x : 0 od{y) 7„| is not empty 

Then switch the data type by doing the following: 

Reset F = 0; 

Let [3 = rnaxordma/s {od(y) : y G X}; 

Update n = n + 1; 

Let m„ = X and 7„ = /?; 

Else let F = F U {a:} - {#}. 

(3) If F 2 {0, 1, ... , m„| then let a be the first example outside {0, 1, ... , m„| 
which had shown up after the n-th switch else let a = m„. 

(4) If n is even and a = nin then conjecture F; 

If n is even and a > m„ then conjecture F U {a, a -I- 1, . . .}; 

If n is odd and a = nin then conjecture F; 

If n is odd and a > m„ then conjecture {0, 1, . . . , a} — F. 

It is clear that the ordinal is downcounted at every switch of the data presenta- 
tion. Thus the ordinal bound on the number of mind changes is satisfied. 

Assume that F is a-admissible, k = card(F) and F = {a;i, X2 , . . . , a;^}. 
Below let n denote the limiting value of n in the above algorithm. At every 
switch, M downcounts the ordinal from a through 71 , 72 , . . . to 7„ and thus 
keeps the ordinal bound. The values mo, mi, . . . ,m„ satisfy the condition that 
Lpinih) yf Lp{mh+i) since the values m/j with odd h are positive and the values 
TTih with even h are negative examples; mg = 0 and thus mo ^ Lphy definition. 
Due to the definition of F it follows that mi > X\,m 2 > X 2 , ■ ■ ■ ,m„ > Xn- By 
induction, one can verify that jh F od(a;/i) for h = 1, 2 , . . . , n. 
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When making the n-th switch, the learner knows that has the opposite 
type of information as from now on observed. 

Thus if no information x > arrives after that switch, it follows that 
xq,xi, . . . ,Xk < rUn and thus E will eventually contain all y such that the type 
of information of y is opposite to the one of If n is even then Lp = E and 
the algorithm is correct. If n is odd then Lp = E and the algorithm is correct 
again. 

If some X > arrives after the last switch, then one knows that M abstains 
from switching due to the fact that whenever an example x > arrives then 
X = 0. As infinitely many of these examples satisfy od(cc) = 7 , for all ordinals 
7 € Ords, it follows that already od(m„) = 0 and therefore these examples 
cannot qualify to go into X. As h od(a:?i) for /i = 1, 2, . . . , n one has that 
7 n = 0 ^ od(x„). It follows that od(x„) = 0 and n = k — 1. The first example 
a > TO„ to show up satisfies a > Xk- Thus every x > a satisfies that Lp{x) = 
Lp{a) and it is sufficient to know which of the x < a are in Lp and not. This is 
found out in the limit and thus the sets conjectured by M are correct. 

It is straight forward to ensure that M always outputs the same index for 
the same set and thus does not only semantically but also syntactically converge 
to an index of Lp. 

Claim. If a Restart Sw„Ex- learner M starts with requesting a negative ex- 
ample first, then M cannot Restart Sw„Ex- learn the whole class Ca- 

Proof of Claim. Let data of type n be negative data if n is even and pos- 
itive data if n is odd. So, for this claim, data of type n is what M requests 
after n switches. In the following, a set E is constructed such that M does not 
Restart Sw^Ex-learn Lp. 

Construction of F. The inductive construction starts with F = %, n = card(F) 
and M requesting examples of type n. There is a finite sequence ctoCTi . . .cfn 
defined inductively such that one of the following cases applies: 

Switch: M makes a switch and requests an example of type n -I- 1; 

LS: (ToCTi . . . (J„ is a locking-sequence for Lp in the sense that for every se- 
quence r of examples of type n for F, M behaves after having seen the 
input CToCTi . . . CT„T as follows: 

— M does not make a switch and continues to request examples of type n; 
— M conjectures Lp. 

Fail: There is a text T of data of type n for Lp such that M, on the sequence 
(TqUi . . . anT, does not converge to a hypothesis for Lp. 

Now the construction of F is continued as follows: 

Switch: After having seen (ToCTi . . . cr„, M downcounts the ordinal to a new value 
7 ' ^ 7 . Now one takes an Xn+i such that 
- od(cc„+i) = 7 '; 

— Xn+i > j/ for all y G F U range(CTocri . . . a„) U {0} 
and adds Xn+i to F. 
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LS: Similarly as in the previous case one can choose a number Xn+i such that 
- od(a;„+i) = -1; 

— Xn+i > 2 / for all 2 / G F U range(CTocri . . . (T„) U {0} 
and completes the construction by adding Xn+i to F. 

Fail: One leaves F untouched and finishes the construction. 

Verification. Note that in the inductive process, adding a number Xn to F never 
makes any previously examples invalid, therefore it is legal to do these modifica- 
tions during the construction. Furthermore, in the case that it is not possible to 
satisfy the case “Switch” in the construction at some stage n, one has that after 
having seen the example-sequence croUi . . . cr„_i (which is the empty sequence 
in the case n = 0) M requests only data of type n as long as it sees examples 
from Lp. Therefore there is a finite sequence cr„ of examples of type n for Lp 
such that fToCTi ... cr„ is a locking sequence for Lp hy the construction of Blum 
and Blum [3], that is, either case “LS” or case “Fail” holds. So it is possible to 
continue the inductive definition in every step. 

As the sequence od(a:i), od(a: 2 ), • . • is a falling sequence of ordinals, it must 
be finite and therefore the construction eventually ends in the cases “LS” or 
“Fail”. In the case “Fail” it is clear that the F constructed gives an Lp not 
learned by M. 

Let F' contain those elements of F which are already in F when the con- 
struction enters the case “LS” . Note that all y > Xn+i are examples of type n for 
Lpi and that Lp and Lpi do not differ on any z G {0, 1, . . . , Xn+i — 1} and that 
thus the information seen so far is consistent with Lp and Lpi. It follows that, 
given any text T of type n for Lp, NL converges on ctocti . . . cr„T to an index of 
Lpi and thus does not learn Lp. □ 

The first claim shows that £q, is RestartSw^Ex-learnable while the second 
claim shows that such a learner cannot start by requesting a negative example 
first. Therefore, if M would be a RestartSw^Ex-learner for Cp and (3 < a, then 
M has to start with requesting a positive example. Now one could consider a new 
Restart S wq, E x- learner which first requests a negative example, then switches 
to positive data and downcounts the ordinal from a to j3 and from that on 
copycats the behaviour of M with an empty prehistory. It would follow that M 
can Restart Sw^Ex- learn Cp\E the new learner RestartSw^Ex-learns Cp and 
starts with requesting a negative example. As this contradicts the second Claim 
above, the assertion that Cp witnesses RestartSw^Ex ^ RestartSw^Ex for 
all /3 ^ a is completed. □ 

Corollary 4.2. Suppose a > (3. Then BasicSw^Ex ^ RestartSw,gEx, in 

particular: 

(a) BasicSwcEx % BasicSw^gEx. 

(b) NewSw^Ex % NewSw^Ex. 

(c) RestartSwo,Ex ^ RestartSw^Ex. 

Proof. The main idea is to use the cylinderification C^^ of the class Ca from 
Theorem 14. II in order to show that 

C<^yi g BasicSwo,Ex — RestartSw^Ex. 
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Then (a), (b) and (c) follow immediately. 

Let (•,•) code pairs of natural numbers bijectively into natural numbers: 
{x,y) = -(- X. The cylinderification of a set L is then defined by 

l^cyi _ S^i^x.y) : X G L,y G Nat} and : L G /!«}, where £„ is as 

defined in Theorem an 

Note that any text for is essentially a fat text for L. Therefore the fact 
£ £ NewSw^Ex implies that £ BasicSw^Ex by using Remark 12.81 On 
the other hand, ^ RestartSw^Ex since £ ^ RestartSw^Ex and by using 
Remark 12.81 again . □ 

5 Conclusion 

The starting point of the present work was the fact, that there is a large gap 
between the data-presentation by a text and by an informant: a text gives only 
positive data while an informant gives complete information on the set to be 
learned. So notions of data presentation between these two extreme cases were 
proposed and the relations between them were investigated. The underlying idea 
of these notions is that the learner may switch between receiving positive and 
negative data, but these switches are either restricted in the number or may 
cause the loss of information. 

For example, the BasicSwEx-learner can at every stage only follow one of 
the text T+ and T_ of positive and negative information on the set L to be 
learned and might therefore miss important information on the other side. 

The results of the present work resolve all the relationships between different 
switching criteria proposed in this paper. In particular it was established that 
the inclusion 

TxtEx C BasicSwEx C NewSwEx C RestartSwEx 

is everywhere proper. Furthermore, the notion RestartSwEx coincides with 
learning from informant. In case of restricting the number of switches to be fi- 
nite or to meet an ordinal bound, RestartSw*Ex and RestartSwc,Ex coincide 
with NewSw*Ex and NewSw^, respectively. The hierarchy induced by mea- 
suring the number of switches with recursive ordinals is proper. 

In summary, the notion NewSwEx and its variant by bounding the num- 
ber of switches turned out to be the most natural definition in the gap be- 
tween TxtEx-learning and learning from informant. The notion of BasicSwEx- 
learning is between TxtEx-learning and learning from informant, but has some 
strange side-effects as pointed out in Remark 12. hi 

Note that these criteria differ from learning from negative open text from |2] 
which was called notion (b) in the introduction: Learning from open negative text 
is weaker than learning from informant and thus different from RestartSwEx. 
On the other hand, the class £/m, co/in and the class £ from Theorem are 
both learnable from negative open text and so separate this notion from the 
other switching criteria mentioned in this paper. 
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Abstract. The main qnestion addressed in the present work is how to 
find effectively a recursive function separating two sets drawn arbitrarily 
from a given collection of disjoint sets. In particular, it is investigated in 
which cases it is possible to satisfy the following additional constraints: 
confidence where the learner converges on all data-sequences; conserva- 
tiveness where the learner abandons only definitely wrong hypotheses; 
consistency where also every intermediate hypothesis is consistent with 
the data seen so far; set-driven learners whose hypotheses are indepen- 
dent of the order and the number of repetitions of the data-items sup- 
plied; learners where either the last or even all hypotheses are programs 
of total recursive functions. 

The present work gives an overview of the relations between these 
notions and succeeds to answer many questions by hnding ways to carry 
over the corresponding results from other scenarios within inductive in- 
ference. Nevertheless, the relations between conservativeness and set- 
driven inference needed a novel approach which enabled to show the 
following two major resnlts: 

(1) There is a class for which recursive separators can be found in 
a confident and set-driven way, but no conservative learner finds a (not 
necessarily total) separator for this class. 

(2) There is a class for which recursive separators can be found in 
a confident and conservative way, but no set-driven learner finds a (not 
necessarily total) separator for this class. 



1 Introduction 

Consider the scenario in which a subject is attempting to learn its environment. 
At any given time, the subject receives a finite piece of data about its environ- 
ment, and based on this finite information, conjectures an explanation about 
the environment. The subject is said to learn its environment just in case the 
explanations conjectured by the subject become fixed over time, and this fixed 
explanation is a correct representation of the subject’s environment. Inductive 
Inference, a subfield of computational learning theory, provides a framework for 
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the study of the above scenario when the subject is an algorithmic device. The 
above model of learning is based on the work initiated by Gold m and has been 
used in inductive inference of both functions and languages. This model is often 
referred to as explanatory learning, in short: Ex-learning. We refer the reader to 
[I2l5l9lldll7j for background material in this field. 

In recursion theory, recursive separability of disjoint languages has been ex- 
tensively explored m- A prominent fact is that there are disjoint recursively 
enumerable sets which cannot be separated by a total recursive function which 
takes 0 on the first and 1 on the second set. Indeed, the question: relative to 
which oracles any two disjoint and recursively enumerable sets are separable has 
been investigated; these oracles turned out to be those which allow to compute 
a complete extension of Peano- Arithmetic |20] . 

In the present work, we consider a combination of learning and separation. 
Thus a machine receives, as input data about two disjoint languages. The ma- 
chine is then expected to come up with, in the limit, a procedure to separate 
the two input languages. A machine is able to separate languages from a class of 
disjoint languages, if it is able to separate any pair of languages from the class. 
The above can be used to model situations such as follows. Consider an em- 
ployee in an embassy which receives letters in various languages. The job of the 
employee is to pass it to appropriate interpreter for translating but may ignore 
junk letters not written in any relevant language. We may expect the employee 
to become an expert in the above process after having seen enough examples 
from each of the languages used in the embassy. This is essentially the model of 
separation we are considering. 

In addition to just separability we also consider various constraints on the 
machine such as reliability, consistency, conservativeness, etc, and study how it 
affects the power of machine to separate pairs of disjoint languages. 



2 Notation and Preliminaries 

Any unexplained recursion theoretic notation is from |22) . The symbol N denotes 
the set of natural numbers, {0, 1, 2,3,.. .}. Cardinality of a set S is denoted by 
card(S'). The maximum and minimum of a set are denoted by max(-), min(-), 
respectively, where max(0) = 0 and min(0) = oo. domain(77) and range(r 7 ) denote 
the domain and range of partial function t] respectively. Sequences are partial 
functions rj where the domain is either N or {y G N : y < x} for some x. In 
the first case, the length of ry (denoted |ry|) is oo, in the second case its length 
is X. Although sequences may take a special value # to indicate a pause (when 
considered as a source of data), this pause-sign is omitted from range(7y) for 
the ease of notation. Furthermore, if a; < |cr|, then a\x] denotes the restriction 
of a to the domain {y ^ N ■. y < x}. We let SEQ denote the set of all finite 
sequences and let a and r range over SEQ. We denote the sequence formed by 
the concatenation of r at the end of a by err. Furthermore, we use ax to denote 
the concatenation of sequence a and the sequence of length 1 which contains the 
element x. 
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We let (•, •) stand for an arbitrary, computable, bijective mapping from N x N 
onto N ^2\- We assume without loss of generality that (•,•) is monotonically 
increasing in both of its arguments. We extend (•,•) to n-tuples in a natural 
way (including n = 1, where (x) may be taken to be x). Due to the above 
isomorphism between and N, we often identify the tuple {x\, ■ ■ ■ ,x„) with 
{xi,---,Xn). 

By ip we denote a fixed acceptable programming system for the partial com- 
putable functions mapping N to N An example for an acceptable pro- 

gramming system is any enumeration of all Turing machines. Further examples 
are standard programming languages such as Basic, Pascal, Fortran . . . provided 
that the data-type of normal variables is N (without upper bound on the values) . 
By ipi we denote the partial computable function computed by the program with 
number i in the (^-system. Symbol TZ denotes the set of all recursive functions, 
that is total computable functions. Symbol 7?.o,i denotes the set of all recursive 
functions with range subset of {0,1}. By we denote an arbitrary fixed Blum 
complexity measure mm for the (p-system. By Wi we denote domain(</3i). Wi is, 
then, the recursively enumerable (r.e.) set/language (C N) accepted (or equiva- 
lently, generated) by the (/3-program i. We also say that f is a grammar for Wi. 
Symbol £ will denote the set of all r.e. languages. Symbol L, with or without 
decorations, ranges over £. By L, we denote the complement of L, that is N — L. 

Symbol £, with or without decorations, ranges over subsets of £. By Wi^g we 
denote the set {x < s | ‘Pi{x) < sj. 

A class £ C £ is said to be recursively enumerable (r.e.) P2], iff £ = 0 or 
there exists a recursive function / such that £ = | i G N}. In this latter 

case we say that lF/(o), W/(i), ... is a recursive enumeration of £. £ is said to be 
1-1 enumerable iff (i) £ is finite or (ii) there exists a recursive function / such 
that £ = I i G N} and yf ^fU)^ if i yf j. In this latter case we say 

that lF/(o), kF/(i), ... is a 1-1 recursive enumeration of £. 

K denotes the diagonal halting set, that is {x \ ipx{x)[}. A pair of disjoint 
languages, L and £', are said to be recursively separable iff there exists a recursive 
function / such that for all x G L, f{x) = 0 and for all x G L' , f{x) = 1. If 
a pair of disjoint languages is not recursively separable, then the pair is said 
to be recursively inseparable. It is well known that there are pairs of disjoint 
recursively enumerable languages which are recursively inseparable. 

Let Disjoint = {£ | (V£, £' G £)[£ n £' = 0 or £ = £']}. That is, classes in 
Disjoint consist only of disjoint languages. 

A function /(•) is said to be limiting recursive, if there exists a recursive 
function g such that, for all x, f{x) = A function F is said to 

dominate a function /, iff for all but finitely many x, F{x) > f{x). Computations 
using oracles can be defined in the usual way by allowing machine access 

to an oracle. Note that there exists a AT-recursive function / which dominates 
every recursive function and which is approximable from below. That is, there 
exists a recursive sequence of recursive functions /s, such that for all s, x, fg{x) < 
fs+i(x), and f{x) = max({/s(a:) : s G N}). We now present concepts from 
language learning theory. 
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Definition 1. m (a) A text T for a language L is a mapping from N into 
{N U {#}) such that L is the set of natural numbers in the range of T. 

(b) The range of a text T, denoted by range(T), is the set of natural numbers 
occuring in T; that is, the language which T is a text for. 

(c) T\n\ denotes the finite initial sequence of T with length n. 

We let T, with or without decorations, range over texts. 

Definition 2. A language learning machine d] is an algorithmic device which 
computes a mapping from SEQ into A^ U {?}. 

Intuitively, “?” above denotes the case when the machine may not wish to make 
a conjecture. Although it is not necessary to consider learners that issue “?” for 
identification/separation in the limit, it becomes useful when the number of mind 
changes a learner can make is bounded. We let M, with or without decorations, 
range over learning machines. M(T[n]) is interpreted as the grammar (index for 
an accepting program) conjectured by the learning machine M on the initial 
sequence T\n\. We say that M converges on T to i, (written M(r)| = i) iff 
(V°°n) [M(T[n]) =f]. 

There are several criteria for a learning machine to be successful on a lan- 
guage. Below we define identification in the limit introduced by Gold [Ill- 

Definition 3. [TT] (a) M TxtEx-identifies a text T just in case (3i | Wt = 
range(T)) (V°°n) [M(T[n]) = i]. 

(b) M TxtEx-identifies a recursively enumerable language L (written: L G 
TxtEx(M)) just in case M TxtEx-identifies each text for L. 

(c) M TxtEx-identifies a class C of recursively enumerable languages (writ- 
ten: C C TxtEx(M)) just in case M TxtEx-identifies each language from C. 

(d) TxtEx = {£ C 5 I (3M)[£ C TxtEx(M)]}. 

By the definition of convergence, only finitely many data points from a function 
/ have been observed by M at the (unknown) point of convergence. Hence, 
some form of learning must take place in order for M to learn /. For this reason, 
hereafter the terms identify, learn and infer are used interchangeably. 

3 Separability 

We now consider the notion of separating the languages. In this case, M receives 
as input texts for two disjoint languages L and L' . M is required to converge 
on the input to a program i such that, = 0, for x G L and ipi{x) = 1, for 

X G L' . For X G LG> U ,\t doesn’t matter what ipi outputs. Thus, for this kind of 
learning, we require the learning machines to be a mapping from SEQ x SEQ to 
IV U{?}. For ease of presentation, we assume that the two inputs to the machine 
are of the same length. That is when we consider M((t, cr'), we assume that 
I (7 1 = \a'\. This is without loss of generality since one can always use padding by 
ff's to make the length same. We further assume, without explicitly stating at 
all places, that the two inputs are disjoint: that is range(cr) n range((r') = 0. 
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We say that M on (T, T') converges to i (written: M(T, T')| = i) iff for all but 
finitely many n, ’M.(T[n\,T'[n]) = i. 

Definition 4. (a) M Resep -identifies (T, T') iff M(T, T') converges to an index 
i such that range(T) C and range(T') C <^“^(1). 

(b) M Resep-identifies {L,L') iff for any text T for L and any text T' for 
L' , M Resep-identifies (T, T'). 

(c) M Resep-identifies £ iff M Resep-identifies all pairs (L, L') where L 
and L' are disjoint sets in C. 

(d) Resep = {£ | £ G Disjoint A (3M) [M Resep-identifies £]}. 
Definition 5. | 5I7I13I19I21|27] 

(a) M is Popperian iff for all a, a' such that range((r) n range(CT') = 0 and 
|cr| = |(t'|, the function computed by M(cr, cr') is total. 

(b) M is consistent iff for all a, a' such that range((r) n range(cr') = 0 and 

|cr| = \a'\, range(a) C and range(cr') C 

(c) M is reliable iff for all T, T' such that range(T) n range(T') = 0, and 
M(T, T') converges, M Resep-identifies (T,T'). 

(d) M is conservative iff for all a, o' and all r, t' such that \o\ = \o'\, |t| = \t'\, 

and range((Tr) n range(cr'r') = 0 , if range(crr) C range(cr'r') C 

M(cr,(j') = M.{ot,o't'). 

(e) M is set driven iff M is total and for all a, o', r, t', such that \o\ = \o'\ and 
|r| = \t'\, if range(cr) = range(r) and range(cr') = range(T'), then M(cr, cr') = 
M(t, t'). 

(f) M is finite iff M is total and for all o, o', t,t' such that |cr| = |cr'| 
and |r| = |r'|, if M(cr, ct') yf ? then M(crr, cr'r') = M(cr, ct'). That is, a once 
established hypothesis is never changed. 

(g) M is confident iff for all T, T' such that range(T)rirange(T') = 0, M(T, T') 
converges. 

We say that M is consistent on (cr, o') to mean that range(cr) C 
range(cr') C We say that M is consistent on (T, £') to mean that 

M is consistent on (T[n], T'[n]), for all n. We say that M is consistent on (£, L') 
to mean that M is consistent on all (£,£'), where T is a text for L and T' is a 
text for L' . 

Similarly, we say that M is conservative on (7,7'), if for all o,t,o',t' such 
that |cr| = \o'\, |r| = |r'|, or C 7, o'r' C 7', and range(crr) n range(cr'r') = 0 , 
if range(cTT) C <Am\„,,^/)( 0) and range(cr'r') C ^^en M(cr,cr') = 

M(ctt, o't'). 

Definition 6. M conservatively Resep-identifies (£, L') iff M is conservative 
and M Resep-identifies (£,£'). 

Conservativesep = {£ G Disjoint | (3M) [M is conservative, and M 
Resep-identifies £]}. 

One can similarly define Popperiansep, Reliablesep, Setdrivensep, Con- 
sistentsep and Finitesep. 
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Definition 7. M Recsep-identifies {T,T') iff M Resep-identiffes {T,T') and 
V’M(T,T') is a total function. 

One can similarly define Recsep identification of pairs of languages and disjoint 
classes and the class Recsep. 

It is not more difficult to separate k disjoint sets instead of 2. For example, given 
3 sets L, L', L" by their texts T, T', T", one can simulate the Resep-identifier for 
each pair of 2 sets coming up with programs e, e', e" to separate the pairs (L, L'), 
{L, L") and (L', L"), respectively. Then one has that the program d given as 



^d{x) 



' 0, =Q ^ipe'{x)[ =Q] 

1, if '^e{x)[ = 1 A Lpe"{x)l = 0; 

^ 2, a ipe'{x)i = I ^(pe"{x)i =1] 
u, if ipe{x),(pe'{x),Lpe>i{x) are defined 
and no previous case applies; 

. t) otherwise; 



where u is an arbitrary number in {0, 1, 2}, it does not matter which one. It is 
easy to verify then that L C L' C (/j^^(I), L" C and (pd is total 

if the functions iPe" are total. Similar arguments deal with the case of 

4, 5, . . . sets. 

Furthermore, one can also show that for the considered variants Finite- 
sep-identification, Consistentsep-identification, Recsep-identification, Con- 
fidentsep-identification, Reliablesep-identification and Conservativesep- 
identification, the notion does not depend on the number of sets used in the 
definition. So one can without loss of generality restrict oneself to defining ev- 
erything with separating pairs. 

The only special case is if there is a function ip separating all sets. But 
then the learning task for pairs becomes trivial since one only has to identify 
the numbers i and j such that L is mapped by V’ to i and L' to j. So the 
existence of such a ip allows to separate pairs easily. The converse does not hold: 
if one takes the class containing all sets {2cc}, {2x + 1} with x ^ K and all sets 
{2x, 2x + 1} with X G K, then it is Resep-identifiable by an easy algorithm but 
no Ip! separates all the sets of this class. 

The notion of stabilizing and locking sequence is useful. 



Definition 8. (Based on |5|1 OJ l (ct, cr') is a stabilizing sequence for M on (L, L'), 
iff (i) \a\ = |cr'|, (ii) range(a) C L, (ill) range(cr') C L', and (iv) for all r, r' such 
that |t| = |r'|, range(r) C L and range(r') C L' , [M(crr, ct't') = M(cr, cr')]. 



Definition 9. (Based on |5| 1 0j 1 (cr, cr') is a Resep-locking sequence for M on 
(L, L') iff (cr, cr') is a stabilizing sequence for M on {L,L') and L C 
and L' C 

Lemma 10. (Based on jSHOj) d/M Resep-identifies {L,L'), then 

(a) There exists a Resep-locking sequence for M on L. 

(b) Every stabilizing sequence for M on {L,L') is a Resep-locking sequence for 
M on L. 
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Note that a similar lemma applies for other criteria of separation discussed in this 
paper. For ease of notation sometimes we drop “Resep” from Resep-locking 
sequence. 

4 First Results 

A central question within the theory of inductive inference is the relation between 
the various criteria of identification. With respect to the theory of separation, 
the inclusions turn out to be easily provable. The below inclusions either follow 
immediately from the definition or are straightforward. Note that TxtFin de- 
notes identification by a finite machine which never revises its first hypothesis; 
TxtFin is the notion corresponding to Finitesep. Below TxtDecEx is a no- 
tion similar to TxtEx, except that the machine M is supposed to converge to a 
decision procedure for input language, instead of grammar for input language [H] . 

Proposition 11. TxtEx n Disjoint C Resep. 

TxtDecEx n Disjoint C Recsep. 

TxtFin n Disjoint C Finitesep. 

Finitesep C Conservativesep n Confidentsep C Resep. 

Popper iansep C Conservativesep. 

Popper iansep C Recsep C Resep. 

If one can learn a class £ as a class of sets from positive data, one can also 
separate the disjoint sets within C since, for any given L, L' G C, the learner 
takes the hypotheses for L as separators for L,L'. But the next result shows 
that this connection does not hold for the converse direction. 

Theorem 12. There is a class C such that 

(a) C ^ TxtEx. 

(b) £ is finitely (and thus confidently and conservatively) Resep -identifiable; 

(c) £ is reliably Resep -identifiable. 

Proof. Let Mo,Mi, . . . denote a listing of all the learning machines. Let Li C 
{{i,x) \ X G N} he a recursive set such that does not TxtEx-identify Li. 
Let £ = {Li I i G N}. Then, £ witnesses the theorem. □ 

Barzdins |3I9| introduced the notion of behaviourally correct learning where the 
learner outputs infinitely many guesses of which almost all describe the function 
or set to be learned correctly. Behaviourally correct identification is a proper 
generalization of TxtEx-learning. Note that in the previous theorem, the non- 
learnability could even be strengthened by constructing £ such that Li is not 
behaviourally correct identified by the f-th machine; then the resulting class is 
not behaviourally correct identifiable. 

The next theorem establishes that the notions of reliable, consistent and Pop- 
perian separation coincide. The notions of (globally) consistent and Popperian 
learning also coincide for learning sets from positive data while this is no longer 
true for reliable learning: only finite sets are reliably learnable from positive data 
IM Proposition 5.42]. Nevertheless, the three notions also coincide in the world 
of learning {0, l}-valued functions from complete data [2S] . 
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Theorem 13. Whenever some Resep-identifier M 0 / a class C of disjoint sets 
satisfies one of the properties below, one can replace 'M. by a better Resep- 
identifier M' satisfying all of them. 

(a) M is consistent. 

(b) M is reliable. 

(c) M is Popperian. 

In particular, Reliablesep = Consistentsep = Popper iansep C Recsep. 



Proof. If M is consistent or reliable, then for every a, a', such that \a\ = \a'\, 
let each of the functions to-.o-' and d^r^a' take the first case in the below case- 
distinctions which applies: 



ta.a' (^) 



da. a' (^) 



{ 0, if X G range(cr) U range(cr'); 
s, the first s such that either 

or M(cra;", a'#®) yf 

t, otherwise; 

{ 0, if ta.a'i = 0 A a: G range(cr) 

or ta,a'{x)l > 0 A ^ M.{a,a'); 

1, if ta.a'i = 0 A X G range(cr') 

or ta,a'{x)i > 0 AM(cra;‘-.-'(^),CT'#‘-.-'(^)) = M(cr,cr'); 
t, otherwise, that is, ta.a'ix)^. 



If M is either consistent or reliable, then for all a, a', the functions ta.a' and 
da, a' are total. Furthermore, if (cr, a') is a separating locking sequence for M on 
{L,L'), then L C d~^^,{0) and L' C d~^^,{l). 

If M is Popperian, then let d^^a' denote 'PMia.a')- 

Thus, if M is Popperian, reliable or consistent, then for da, a' defined as above, 
(i) for all (J, cr': da, a' is total and (ii) if M Resep-identifies {L,L'), then there 
exists a a, a' such that L C and L' C d“^,(l). 

Let pq,pi, . . . be a recursive sequence of programs such that {(pp^ \ i G N} = 
{da, a' I cr, cr' G SEQ A |cr| = |cr'|}U{x|xis the characteristic function of some 
finite set}. Note that such an enumeration of programs exists. 

Now define M' as follows. M'(r, r') = pi, where i is minimal such that 
range(r) C and range(r') C Since the sequence pfs contains 

programs for characteristic function of every finite set, above M' is total. 

It is easy to verify that M' Resep-identifies £ and M' is Popperian, consis- 
tent and reliable. (Thus M' also Recsep-identifies C.) □ 

Note that M' defined in above proof is set-driven. Thus, we also have 



Corollary 14. Consistentsep C Setdrivensep. 
Popperiansep C Setdrivensep. 

Reliablesep C Setdrivensep. 



For the following proposition we need concepts from function learning, which are 
similar to language learning, but where the inputs are graphs of functions, and 
outputs are programs for computing the function. Alternatively, these criteria 
can be considered as restriction of language learning to the cases, where the 
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input is from the class of single valued total languages only; where a language L 
is single valued total, iff it satisfies the following properties 

(i) single valuedness: (Va:)(V 2 /, z) S L and {x,z) G L] => y = z]; 

(ii) totality: (Va;)(3?/) [{x,y) G L]. 

We refer the reader to for the details on function learning. 

Proposition 15. Suppose C C TZq^i be given. Then one can define an enumer- 
ation /o, /i, . . . of functions in C, such that the class Cc given by 

Li = {{i,x,fi{x)) I X G N}, 

L' = {{i,x,l- ffix)) I a; G fV}, 

Jrc^{L^\ieN}u{Lr\ie N} 

satisfies the following: 

(a) C G ReliableEx ^ Cc & Reliablesep. 

(b) C G Ex Cc C Recsep. 

(c) C G ConfidentEx 4^ Cc C Confidentsep. 

(d) C G Fin ^ Cc C Finitesep. 

(e) C G Popper ianEx £c G Popperiansep. 

Proof. Let Mq, Mi, . . . denote a listing of all learning machines. It is easy to see 
that for arbitrary enumeration /o, /i, . . . of functions in C, => of (a) to (d) above 
holds. 

For 4= we show part (a) only. Other parts can be similarly shown. Suppose 
that C ^ ReliableEx. Let Lij = {{i,x, f{x)) \ x G N} and L'- ^ = {(i,a;,l — 
f{x)) I X G N}. Note that for all i, such that is reliable separator, there exists 
an / G C, such that does not Resep-identify (otherwise, one can 

easily modify to show that C G ReliableEx). If is not reliable, then let 
fi be arbitrary function in C. Otherwise let fi be a function in C such that 
does not Resep-identify {Lij.,L[ j,). Let Cc = {Lij^ \ i G N} U {£' f.\iC N}. 
It follows that Cc ^ Reliablesep. 4= of Part (a) follows. □ 

The proof of Proposition [13 permits now to transfer the following noninclu- 
sions from the theory of learning functions [Il4l9ll9l2ll28l2il^ to the theory of 
learning separations. 

Corollary 16. Recsep Reliablesep. 

Reliablesep ^ Confidentsep. 

Confidentsep % Reliablesep. 

Popperiansep % Confidentsep. 

Confidentsep ^ Finitesep. 

Osherson, Stob and Weinstein ini Exercise 4.4.2C] noted that a class which 
consists only of infinite sets has already a set-driven learner. This result directly 
transfers to the above separation-problems derived from function-classes. Thus 
one has, that the classes witnessing the non-inclusions in Corollary [TBl are also 
set-driven separable. 
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Corollary 17. Setdrivensep % Confidentsep. 

Setdrivensep ^ Reliablesep. 

Setdrivensep ^ Finitesep. 

Proposition 18. If L and L' are a recursively inseparable pair, then the class 
C = {L,V} is finitely {and thus confidently) Resep -identifiable but notRecsep- 
identifiable {and thus also neither consistently nor reliably Resep -identifiable) . 

However, above proposition uses the fact that there doesn’t exist any recursive 
separator for {L,L'). The following theorem shows that one can do the separa- 
tion of TxtFin and Recsep (and thus of Finitesep and Recsep) even if all 
languages in £ are recursive. The proof uses a modification of the technique from 
Proposition H5] combined with the fact that there is even no limiting-recursive 
procedure to remove undefined places from programs, even if it does not matter 
which values are filled in at these places. 

Theorem 19. Finitesep n Setdrivensep n TxtFin ^ Recsep. 

Furthermore, some class £ C 7?. witnesses this separation. 

Proof Let = {{x,y) \ (fix{y)i = 0}, £(, = {{x,y) \ <fix{y)i = 1} and £ = 
{Lx,L'„, I £3, ^ 0 and £(, 0 and card({y | (x,y) ^ Lx LI £(,}) < 1}. Note that 

all sets in £ are recursive. 

Given two disjoint sets H, H' from £ one can give a program for the below 
function if being 0 on H and 1 on H' after just knowing one element {x, y) and 
{x',y') of H and H' as follows: 



It is straightforward to extend the definition of the learner such that it becomes a 
Finitesep-identifier (by omitting any further mind change) or a Setdrivensep- 
identifier (by taking always the least pairs {x,y) and {x' ,y') available from the 
input). Similarly one can show that £ is TxtFin-learnable. 

If there were a Recsep-identifier M for £, then one could construct a proce- 
dure which, using oracle K, transforms a given program p, such that ipp is defined 
and 0 or 1 at all but at most one place, into a program for a total extension of 
ipp — but such a if-recursive algorithm does not exist. Thus £ ^ Recsep. □ 

5 Conservative Separability 

The most involved separations are linked to conservativeness, the corresponding 
results are the main results of this work. Before investigating them in detail, 
recall that the following two inclusions were already mentioned in Proposition II II 

— Finitesep C Conservativesep. 

— Popperiansep C Conservativesep. 



ijj{{v,w))= 
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They can be used to obtain the following noninclusions which have previously 
been obtained for Finitesep and Popper iansep, respectively. 

Corollary 20. Conservativesep ^ Recsep. 

Conservativesep % Confidentsep. 

Although one can show that every procedure to learn separations can be trans- 
formed into one to learn these separations conservatively, this transforma- 
tion is not effective and gives a noncomputable Conservativesep-identifier. 
Later below (Theorem |22} it will also be shown that this loss of recursive- 
ness is unavoidable since there is a class which is Recsep-identifiable but not 
Conservativesep-identifiable. In the following, let NonCompConservative- 
sep denote the class of languages that can be Conservativesep-identified, by 
dropping the constraint that learner has to be computable. 

Theorem 21. Resep C NonCompConservativesep. 

Proof. This proof is almost identical to the proof for the corresponding result 
which Osherson, Stob and Weinstein m Proposition 4.5.1A] stated in the con- 
text of learning sets from positive data. Suppose M Resep-identifies C. Define 
F as follows. 



It is easy to verify that F is conservative, and Resep-identifies any (L, L') which 



The next two results are the main results of the paper, which establish that the 
notions of conservative and set-driven separation are incomparable; moveover 
the two classes witnessing the two non-inclusions have a Setdrivensep-identifier 
and Conservativesep-identifier, respectively, which in addition is also a confi- 
dent Recsep-identifier. Note that this result stands in contrast to the situation 
of learning sets from text where every set-driven learnable class is also conser- 
vatively learnable m Theorem 7.1]. 

Theorem 22. Confidentsep n Recsep nSetdrivensep Conservativesep. 

Proof. Let Mq, Mi, . . . denote an enumeration of total machines such that for all 
M, there exists an i such that, if M conservatively Resep-identifies £, then 
conservatively Resep-identifies C. Note that there exists such an enumeration 
of machines (see for example, [TS], for similar result for TxtEx-identification) . 



Following an argument of Jockusch m, there exist recursive functions g, h 
such that, for all x,y, 




F(cr,cr'), if range(cra;) C 

and range(cr'?/) C 




is Resep-identified by M. 



□ 




(A) Wg(^x,y) and Wh{x,y) are infinite disjoint subsets of Ox,y 
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(B) If Wy is infinite, then Wgt^^^y) and Wh{x,y) partition the set Ox^y 

(C) If Wy is finite, then Wg(^x,y) and Wh(^x,y) form a recursively inseparable pair. 

Let ConsM = {x \ (Vy) (V finite Lx,L'^ C Ux,y \ H = 0) [M^, is conser- 
vative on {Lx,L'^)\}. Note that ConsM is recursively enumerable. We will later 
construct a recursive / such that for all x and y, Wf(x,y) is a recursive subset of 
Ex^y In addition, for all x, we will define Lx and We will ensure that, for 
all X, there exists a y such that: 

(D) Ex, L'x. C Ux^y 

(E) Ma; does not Conservativesep-identify {Lx,L'x). 

(F) If x G ConsM and {Lx U L'x) n Wf(^x,y) ^ 0> then Lx and L'x are both finite 
subsets of Ux y, and card(La; U L'x) < 2 + min((La; U L'x) fl Wfi^x y)) and 
{Lx U L'x) n 

a:,y),max(L^UL^) 7^ 

(G) If X G ConsM and {Lx U L'x) n Wf(^x,y) = 0, then Wy is infinite. Lx = 

Ex,y ^^f(x,y)j and Lx ^^h{x,y)- 

(H) If X ^ ConsM, then Lx = range(a) U {d} and L'x = range(cr'), where (cr, a') 
is the least pair such that range(a) and range((j') are disjoint subsets of Ux,y 
and Mj, is not conservative on {a, a'), and d G Ox^y is the least number such 
that X is enumerated in ConsM within d steps and d > max(range(cr) U 
range((r')). 

Now let C — {Lx I X G N} U {L'x \ x G N}. 

By (E), C ^ Conservativesep. Using (A), (B), (D), (F), (G) and (H) above, 
we easily have that C G Confidentsep n Setdrivensep n Recsep. 
Construction of f. We now construct Wf(^x,y)- After the construction, we will 
define suitable Lx and L'x, and show that (D) to (H) are satisfied. 

Initially let ctq = erg = Let y) denote the set of those elements which 
are enumerated into Wf(^x,y) before stage s. Go to stage 0. 

Stage s 

1. Dovetail steps 2 and 3, until search in one of them succeeds. If search in step 

2 succeeds (before the search in step 3), then go to step 4. If search in step 

3 succeeds (before the search in step 2), then go to step 5. 

2. Search for z G Ex,y such that z > max(range((Ts) U range((j() U {s}) and 

V^]VI(( 7 s ,(T^) (-^) 1 d. 

3. Search for r* and r' such that the following conditions are satisfied. 

ksl = ksl- 

CTs C Ts and range(Ts) C Wg(x,y) U Ex,y - WJ^x,y)- 
cr( C r' and range(r() C Wh(x,y)- 
M(cr^,cr() yf M(rs,r(). 

4. Enumerate z into Wf(x,y)- 

Search for Ts and such that the following conditions are satisfied. 

ksl = ksl- 

(Ts C Ts and range(Ts) C Wg(x,y) U Ex,y - {Wjf^x,y) 'd 
cr( C r( and range(r() C Wh(x,y)- 

M((Ts,( 7 () yf M(ts,t'). 
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If and when such r* and r' are found, go to step 5. 

5. Let (Ts+i = Ts and cr'+i = r'. 

Go to stage s + 1. 

End stage s 

Verification of the properties (D) through (H). Note that either 
is finite, or there exist infinitely many stages, and s G ^f{x,y)j iff s G y)' 
Thus Wf(^x,y) is recursive. 

For each x G iV, we now consider the following cases. 

Case 1: X ^ ConsM. 

In this case, let cr, ct' be the least pair such that, for some y, range((r) 
and range(CT') are disjoint subsets of Ux,y and M^, is not conservative on 
(cr, cr'). Let Lx = range((r) U {d} and L'^ = range(cr'), where d G Ox,y is 
the least number such that x is enumerated into ConsM in less than d 
steps and d is larger than any element of range(CT) Urange(cr'). Note that 
there is a partial-recursive function which computes explicit lists of the 
elements of Lx and L'^ for every x ^ ConsM. 

Thus, (D), (E) and (H) are satisfied, and (F) and (G) do not apply. 

Case 2: X G ConsM and there exists a y such that Wy is infinite and does 
not Resep-identify {Wg,^x,y) U Ex,y - Wf(^x,y),Wh(x,y))- 

In this case, let Lx = Wg(^x,y) U Ex^y - Wf(^x,y), and L'^ = Wh(x,y)- Now, 

Mj; does not Resep-identify {Lx^L'^). 

Thus, (D), (E) and (G) are satisfied, and (F) and (H) do not apply. 

Case 5: X G ConsM and for all y such that Wy is infinite, M^, Resep-identifies 
{Wg(^x.y) U Ex^y — Wf(x,y)i Wh{x,y))- 

In the following we will select finite Lx,L'^ with Lx H Wf(^x,y) ^ 0; for 
some y, satisfying conditions (D), (E) and (F). 

Now we deal with Gase 3 in detail: Let Ii = {y \ (3s) [ in the construction of 
Wf(^x,y), step 4 of stage s is started but does not end ]}. Note that, {y \ Wy is 
infinite} C C. Furthermore, Ii is recursively enumerable relative to the oracle K. 
Thus, for every y € h one can find s, 2 : and CsyCg (depending on y) using the 
oracle K, where in the definition of Wf(^x,y)j s is the stage in which step 4 is 
started but does not end, and z is as defined in step 4 of stage s. 

Using the oracle K, one can also test whether the following two conditions 
hold: 

(PI) Ma;(CTsZ(i”,cr(#”+^) = Ma;((Ts,(T(), for all d G Wg(x,y) and all n; 

(P2) Ma;(CTsZ#”,tT((i”+^) = Mx(as,a'^), for all d G Wh(x,y) and all n. 

Let ^2 = {y G /i I (PI) and (P2) are satisfied}. Note that I 2 is recur- 
sively enumerable relative to the oracle K. Note that, if Wy is infinite, and 
Ma; conservatively Resep-identifies {Wgt^x.y) U Ex^y — W j(^x,y),Wh{x,y)), then 
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- ^ 9 (s.y)U{ 2 :} and - ^K^,v)- Thus, y must satisfy 

(PI) and (P2). Thus, I 2 {y \ Wy is infinite}. Since I 2 is recursively enu- 
merable relative to oracle K, and {y \ Wy is infinite} is il 2 -complete, there 
must exist a y such that Wy is finite, and y & l 2 - For the following, fix 
such a y, and corresponding s, z, as and a's, where s is the stage in which 
step 4 of hF/(a;,y) starts but does not finish, and 2 is as defined in stage s. 
Let A = and B = 

Case 3.1: At least one of the sets Wgi^x,y) ~ A and Wh(x,y) — B is infinite. 

If card(Wg(a;_y) — A) = 00 , then let d G Wyt^^.y) — A be such 
that z G Wf(^x,y),d- Now, Ma; does not Resep-identify (range(crs) U 
{z, d}, range((j()), since y satisfies (PI). Thus, we define that = 
range((Js) U {z,d} and = range(cr(). Note that card(La, U L(.) < 
2-|-card(range(crs)Urange(CT()) < 2-|-z, and z G LxnWf(^x,y),me.K{L^uL'j- 

Similarly, if cai(l{Wh{x,y) — B) = 00 , then let d G Wh(x.y) — B he such 
that z G Wf(^x,y),d- Now, Ma; does not Resep-identify (range(crs) U 
{z}, range((r()U{d}), since y satisfies (P2). Thus, let Lx = range(CTs)U{z} 
and = range(CT()U{d}. Note that card(La;UL(,) < 2-|-card(range((Ts)U 
range((7()) < 2 -h z, and z G La, n IF/(a,,y),max(La,uLg- 
Thus, (D), (E) and (F) are satisfied, and (G) and (H) do not apply. 

Case 3.2: range(crs) ^ A or range(cr() % B. 

Let d G Wg(^x.y) ~ range((Js) be such that z G Wf(^x,y),d- Let Lx = 
range((Js) U {z,d}, L}, = range(cr(). 

It is easy to verify that (D), (E) and (F) are satisfied and (G) and (H) 
do not apply. 

Case 3.3: range(cTs) C A, range((j() C B, and the two sets Wg(^x,y) ~ A and 
Wh{x,y) — B are both finite. 

Since Wg(x,y) and Wh{x,y) form a recursively inseparable pair, we must 
have that A n Ox,y and B H Ox^y are not recursive. Since x G ConsM, 
the set 

C = {dGOx,y\{3n,m) [M,(a,zd", a'#"+i) ^ M,(a„ a^)] 

A [M,(a,z#™,a(d™+i) ^ Mx{as,a',)]} 

is disjoint to A and B. However, caxd{Ox^y — (A U L U G)) = 00 , due to 
non-recursiveness of AfiOx,y and Br\Ox,y. Thus, there exists a d G Ox,y — 

(A U L U G), such that z G Wf(^x,y),d- If for all n, Ma,((TsZd”, cr(#"+^') = 
M.x{as,a's), then let Lx = range(as) U {z,d}, L(, = range(cr(). 

Otherwise, for all n, Ma,(CTsZ#”, cr(d”+^) = Ma,(cTs, cr(). In this case let 
Lx = range((Ts) U {z}, L(, = range((j() U {d}. 

Thus, (D), (E) and (F) are satisfied, and (G) and (H) do not apply. 



From the above cases 1, 2, 3.1, 3.2 and 3.3, we have that (D) to (H) are satisfied. 
Thus proving the theorem. □ 
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Theorem 23. ConfidentsepnConservativesepHRecsep % Setdrivensep. 

A proof of Theorem[^is given in the technical report of this paper M Theorem 
24], 

6 Conclusion 

Blum and Blum considered the model of learning extensions of partial re- 
cursive functions. The separations considered in the present work can be viewed 
as a special case of this type of learning, since one could map the class C to 
the class F of all functions 'J'l.L' with S' being 0 on L and being 1 on L' and 
being undefined everywhere else. Now £ is (conservatively) Resep-identifiable 
iff T is (conservatively) learnable in the model of Blum and Blum [^. An ap- 
plication of the construction of a class £ which is Resep-identifiable but not 
Conservativesep-identifiable is, that the corresponding F witnesses, that in 
the model of Blum and Blum some class of partial-recursive functions is learn- 
able in the limit but is not conservatively learnable. This gives a contrast to 
the case of learning total recursive functions where Stephan and Zeugmann m 
showed that conservativeness is not restrictive. 

Although every separation problem is the special case of a learning prob- 
lem in the model of Blum and Blum [S], there is no general correspondence 
between these worlds. For example, there are reliably but not consistently learn- 
able classes of functions while these notions coincide in the case of separating 
languages. 
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Abstract. In inductive inference, a machine is given words in a language 
and the machine is said to identify the language if it correctly names the 
language. In this paper we study classes of languages where the unions 
of up to a fixed number (n say) of languages from the class are identifi- 
able. We distinguish between two different scenarios: in one scenario, the 
learner need only to name the language which results from the union; 
in the other, the learner must individually name the languages which 
make up the union (we say that the unioned language is discerningly 
identified). We define three kinds of identification criteria based on this 
and by the use of some naturally occurring classes of languages, demon- 
strate that the inferring power of each of these identihcation criterion 
decreases as we increase the number of languages allowed in the union, 
thus resulting in an infinite hierarchy for each identification criterion. 
A comparison between the different identification criteria also yielded 
similar hierarchies. We show that for each n, there exists a class of dis- 
joint languages where all unions of up to n languages from this class can 
be discerningly identified, but there is no learner which identifies every 
union of n -f 1 languages from this class. We give sufficient conditions for 
classes of languages where the unions can be discerningly identified. We 
also present language classes which are complete with respect to weak 
reduction (in terms of intrinsic complexity) for our identification criteria. 



1 Introduction 

We continue a line of enquiry explored in [Wri89ISA00IGK99] , where the learner 
is required to learn unions of languages drawn from a class of languages. 

What is different from previous studies is that we distinguish between two 
different scenarios. In one scenario, the learner is only required to name the 

N. Abe, R. Khardon, and T. Zeugmann (Eds.): ALT 2001, LNAI 2225, pp. 235- 125^ 2001. 

(c) Springer- Verlag Berlin Heidelberg 2001 
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language which results from the union; in the other, we want the learner to indi- 
vidually name the languages which make up the union — in a sense, the learner 
is discerning between the languages in the union. Our study is motivated by the 
abundance of situations where learners are presented with information that is 
some sort of mixture. For example, children in a multi-lingual environment are 
frequently exposed to more than one (natural) languages at the same time, but 
are nonetheless able to tell what are the languages they hear; or, in a physical 
experiment, radiations collected by the same detector may originate from many 
different source processes, for which scientists are often put to the task of dis- 
cerning. We are also interested in devising mechanisms which will allow us to 
distinguish between languages that has to be presented as a mixture. 

In the course of identifying the languages which make up a union, what 
happens when there are two (or more) possible sets of languages from the class 
which unions to the same language? Should the learner be required to name both 
possibilities, or should the learner be allowed to choose any one? Or perhaps 
such a situation should be simply declared unlearnable? We formalize different 
identification criteria based on these considerations. 

It can be said in general that the inferring power of learners lessen when 
more languages are allowed in the union, and moreover, there are naturally 
occurring classes of languages which hold up these hierarchies. We also noticed 
hierarchies between each of the different identification criteria. More notably, for 
each n, there exists a class of disjoint languages where all the unions of up to n 
languages from this class can be discerningly identified, but there is no learner 
that can identify every union of n -I- 1 languages from this class. 

In our attempt to characterize these identification criteria, we discovered two 
sufficient conditions for classes of languages where the unions can be discerningly 
identified. We demonstrate that one of these conditions is difficult to be further 
relaxed, by showing how some weaker conditions are insufficient to hold up the 
same results. Finally, we give natural classes of languages which are complete 
with respect to weak reduction in terms of so-called intrinsic complexity |FKS95| 
for the identification criteria we defined. 

Due to space constraints, we omit proofs for some of the results. Some of our 
results can also be generalized for other identification criteria. 

2 Notation and Preliminaries 

Any unexplained recursion-theoretic notation is from |Rog67| . N denotes the set 
of natural numbers. denotes the set of positive integers. Let rat denote the 
set of non-negative rational numbers. 0, S, C, C, D, D respectively denote 
empty set, element of, proper subset, subset, proper superset, superset. max(.), 
min(.) denote maximum and minimum of a set, where by convention max(0) = 0 
and min(0) = oo. Cardinality of a set S is denoted by eard{S). Dq, D\, . . . stand 
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for a computable sequence of all finite sets |Rog67| . A — B denotes the set 
{x\x & A and x ^ B}. 

{■,■) stands for an arbitrary, computable bijective mapping from N x N onto 
N. For all x and y, TTi{{x,y)) = x and TT 2 {{x,y)) = y. We assume without loss 
of generality that (•, •) is monotonically increasing in both of its arguments. (•, •) 
can be extended to n-tuples in a natural way (including n = 1, where (x) may be 
taken to be x) . Projection functions tti , . . . , 7r„ corresponding to n-tuples can be 
defined similarly (where the tuple size would be clear from context). Due to the 
above isomorphism between and N, we often identify the tuple {x \, . . . , x„) 

OO OO 

with {xi , . . . , Xn)- The quantifiers V, 3 and 3! denote, for all but finitely many, 
there exists infinitely many and there exists a unique, respectively. 

A computable numbering is a partial computable function from N'^ to N. 
The symbol ip ranges over computable numberings. We denote by V'ij the par- 
tial function, Xx.'ijj{i,x). Thus ij^i denotes the partial function computed by the 
program with index i in the numbering ijj. W denotes an arbitrary Blum complex- 
ity measure for ^|J. Wf denotes domain{ipi). Wf is, then, the r.e. set/language 
(C N) accepted (or equivalently, generated) by the ■i/j-program i. We also say 
that j is a i/;-grammar for Wf . denotes the set {a; < s | 'l'i{x) < s}. We 
say that numbering ij) is reducible to numbering ip' (written ip -< ip') if and only 
if there exists a recursive function h such that {\/i)[ipi = fp'h{i)]- ^his case we 
say that h witnesses that ip ^ ip'. An acceptable numbering is a computable 
numbering to which every computable numbering can be reduced. The symbol 
ip denotes a standard acceptable numbering |Rog67| and the symbol ^ denotes 
an arbitrary fixed Blum complexity measure for the (/j-system |Blu67| . In this 
paper we abbreviate Wf to Wi, and to Wi^s- 

£ denotes the class of all r.e. languages. TZ denotes the set of all recursive func- 
tions, that is total computable functions. Symbol L, with or without decorations, 
ranges over £. The symbol C, with or without decorations, ranges over subsets 
ol £. K denotes the diagonal halting problem set, that \s, K = {x \ x € Wx}- 
{K is a recursively enumerable, non-recursive set.) SINGLE denotes the set 
{{x} I X € Nj. FIN denotes the set {D C fV | D is finite}. INIT denotes the 
set {{a: € N \ x < n} \ n G N}. 

A class C of r.e. languages is said to be recursively enumerable |K5g67l if there 
is S' G £ such that C = {Wi \ i G S}. For each infinite, recursively enumerable 
class of languages £, there exists a total recursive function / such that C = 
{Wy(i) I i G N}. C is said to be 1-1 recursively enumerable if and only if (i) £ 
is finite or (ii) there exists a recursive function / such that £ = {lTy(q | f G fV} 
and Wf[i) yf kF/(y), for i yf j. In this latter case we say that lF/(o), lF/(i), ... is 
a 1-1 recursive enumeration of £. 

A partial function d from N to N is said to be partial limiting recursive, if 
and only if there exists a recursive function F from N x N to N such that for all 
X, d{x) = limy^oo F{x, y). Here if d{x) is not defined then limy^oo F{x, y), must 
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also be undefined. A partial limiting recursive function d is called (total) limiting 
recursive, if d is total. J, denotes defined or converges. | denotes undefined or 
diverges. 

We now present concepts from language learning theory. The next definition 
introduces the concept of a sequence of data. 

Definition 1. |Gol67j (a) A sequence cr is a mapping from an initial segment 
of N into (iV U {#}). The empty sequence is denoted by A. 

(b) The content of a sequence a, denoted content(a) , is the set of natural num- 
bers in the range of cr. 

(c) The length of cr, denoted by |cr|, is the number of elements in cr. So, |yl| = 0. 

(d) For n < |cr|, the initial sequence of a of length n is denoted by a[n]. So, 
cr[0] = A. 

Intuitively, #’s represent pauses in the presentation of data. We let ct, t, and 
7 , with or without decorations, range over finite sequences. SEQ denotes the set 
of all finite sequences. 

Definition 2. | Gol67| (a) A text T for a language L is a mapping from N into 
{N U {#}) such that L is the set of natural numbers in the range of T. 

(b) The content of a text T, denoted by content(T), is the set of natural numbers 
in the range of T; that is, the language which T is a text for. 

(c) T[n] denotes the finite initial sequence of T with length n. 

We let T, with or without decorations, range over texts. We let T range over 
sets of texts. 

Definition 3. | Gol67| An inductive inference machine (IIM) is an algorithmic 
device which computes a mapping from SEQ into N . 

M(T[n\) is interpreted as the grammar (index for an accepting program) 
conjectured by the machine M on the initial sequence T\n\. We say that M 

OO 

converges on T to * (written M{T)[ = i) if (V n)[M(T[n]) = i]. 

Let Mq, Ml, . . . denote a sequence of the IIMs, such that every class in 
TxtEx is identifies by at least one of the machines in the sequence |OSW86| . 

Gold [Gol67J introduced the following language learning criterion known as 
Txt Ex-identification . 

Definition 4. |Gol67j (a) M TxtEx-identifies a text T just in case (3i | Wi = 

OO 

content{T)) (V n)[M{T[n]) = i]. 

(b) M TxtEx-identifies an r.e. language L (written L G TxtEx(M)) just in 
case M TxtEx-identifies each text for L. 

(c) M TxtEx-identifies a class C of r.e. languages (written C C TxtEx(M)) 
just in case M TxtEx-identifies each language from C. 

(d) TxtEx = {£ C £ I {3M)[C C TxtEx(M)]}. 
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3 Identification of Unions of Languages 

Definition 5. |SA00] Let £ C £. 

(a) The union language of C, Lc = UlgL 

(b) The class of at most k unions of C, = { Lc | £' C £ A card{C) < k}. 

We now define an identification criterion for the learning of unions of lan- 
guages. 

Definition 6. Let k € and £ C £. 

(a) M U^TxtEx-identifies £ just in case £^ C TxtEx(M). 

(b) U'^TxtEx = {£ C £ I {3M)[M U'^TxtEx-identifies £]}. 

UTxtEx coincides with the definition of “identification of unions of lan- 
guages” in |Wri89ISA00| . 

Wright [Wri89IMSW91| showed a sufficient condition (finite elasticity) for 
indexed families |Ang80| of recursive languages to be in U”TxtEx for all n. 
Shinohara and Arimura noted that this result does not apply to the unions of un- 
bounded number of languages and provided a sufficient condition for U*TxtEx 
membership in [SAOOj . 

We now define an identification criterion, where the learner must furthermore, 
individually identify each of the languages in the union. 

Definition 7. Given £ C £ where card{C) < oo. 

(a) We say a set of indices {xi,X 2 , ■ ■ ■ ,Xcard{C)} Q N is a, representation index 
set o/£ just in case {W^,,Wx 2 , ■■■, 

(b) Let = {/ I / is a representation index set of £}. 

(c) Let T = {I \ (3£ C £, card{C) < oo)[/ € U;]}. 

Any representation index set {xi, X 2 , ■ ■ ■ , Xcard(C)} can be represented by 
a natural number k where Dk = {xi,X 2 , ■ ■ ■ ,Xcard{C)}- This representation is 
implicit whenever the context requires such an interpretation. 

Definition 8. Let k € and £ C £. 

(a) M DU^TxtEx-identifies £ just in case (V£' C £ | card{C) < /c) (V T for 
Lc) [M{T)l A M{T) €lc]. 

(b) DU'^TxtEx = {£ C £ I {3M)[M DU'^TxtEx-identifies £]}. 

The D in DUTxtEx stands for discernible. It is clear from the definitions 
that DU^TxtEx = U^TxtEx = TxtEx. 

Proposition 1. Given £ C £. // there exists C G C, C £", but Lc = 
Lc" , then C ^ DU^ TxtEx for k = max(card(£'), card{C")). 
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Definition 9. Let £ C 5 and k G -/V+. The class of languages is said to be 
uniquely definable from C just in case (VL G C^){3\C C C \ card(C') < k)[Lc' = 
L]. 

We now introduce an identification criteria where the complications of Propo- 
sition[T]is avoided. The learner is considered correct by simply naming any set of 
(up to) n languages in the class which make up the language of the input text. 

Definition 10. Let k G N~^ and £ C £. 

(a) M WDU^TxtEx-identifies £ just in case (V£ G £^) (VT for L) [M{T)l A 
(3£' C £ I eard(C') < k) [M{T) G Ic A T happens to be a text for Lc]\- 

(b) WDU'^TxtEx = {£ C £ I (3M)[M WDU'^TxtEx-identifies £]}. 

The W in WDUTxtEx stands for weak. 

To avoid Proposition [T] we may also require the learner to name all possible 
unions of languages in the class which result in the language of the input text. 
We do not consider this alternative in this abstract. 

4 Hierarchy Results 

We now describe a natural class of languages which give rise to our hierarchy 
results. 

Fix n G fV’*', n > 2. Let vi,V 2 ,..., Vn-i be unit vectors along each axis of an 
(n — l)-dimensional space. Let be a simplex with n vertices, respectively at 
—Vi,Vi,V 2 , ■ ■ ■ ,Vn-i- (For n = 2, the vertices are at Vi and — Ui.) 

Let RATn be the set of all the points in an (n — l)-dimensional space with 
only rational valued coordinates, and let coderat„(.) be an effective bijective 
mapping from RATn to N. Let Tn = {X)r=i I ^ ~ 

{Gn + T\TGTn}. 

For each G G 4„, the polytope of G, denoted P{G), can be defined as the 
set of all points X which satisfy n linear equations Vk ■ X < bk, k = 1,2, . . . ,n 
where for each k, the coefficient bk and the vector Vk can be obtained by solving 
n — 1 linear equations (each formed by substituting in the equation a vertex of 

G).0[S3] 

Let Lang{G) = {coderat„(X) | X G P{G) A X G RATn}- Let 
TRANSIMn = {Lang{G) \ G G An}. 

We now give some properties of An (and hence TRANSIMn) which we shall 
use to demonstrate our hierarchy results. 

Claim (1). Given G G An with vertices at Ai,A 2 , . . . ,An. Let V = {Ai, A 2 , . . . , 
An}. Let each outward normal for the hyperplane formed by V—{Ai} be denoted 
17 . Let G' = G + T where T G Tn, then T- 1^< 0 ^ (P - {4J) n P(G') = 0. 

^ Intuitively, the inequality for each k represents a bounding hyperplane for the poly- 
tope, where each vector Vk is the outward normal for the bounding hyperplane. 
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Claim (2). Let n>2. Given G & An with vertices at A\,A 2 , . . . , Let C be a 
point in P{G). For each i G N, 1 < i < n,let~fti = (1/| CAi |) CAi and Gi{S) = 
G + S Hi . There exists a collection of n simplexes Gi(e^), 02 ( 62 ), . . . , 
where each e' > 0 and n numbers ^ 1 ,^ 2 , ■ ■ ■ ,^n where 0 < < e' such that 

(V<5„ 0 < <5. < e*)[P(G.(5.)) C U”=i P{Gj{e'^))]. 

Proposition 2. (Vn G N~^)[DIJP TxtEx — iP^^TxtEx^ 0]. 

Proof. The case of n = 1 is shown by the class of languages {K}LI SINGLE. We 
now show the case for n > 2. 

Let PRIMES be the set of all the prime numbers and pi,p 2 , ... be an enu- 
meration of PRIMES in ascending order. Let ^ be a computable numbering for 
which (Vi G fV)[Wj5( = Wi]. 

For each G G 2l„, let Xi (G) = T-v\, where T G T„ is such that G = G„ -I- T, 
and let Lg = {(0,x) | x G Lang{G)} U {(l,y) \ y G }, where h{a) 

is the denominator of a in reduced form. Clearly, is a recursive function. Let 
ExtTRANSIMn = {Lq \ G G A„}. Using Claim (1), it can be verified that 
ExtTRANSIMn G DU”TxtEx. 

Let yl C rl„ be a collection of n simplexes as in Claim (2). Without loss of 
generality, we require that the numbering ip has it that (VG G A) [M^/f(Xi(G)) 
= 0]. Let Ga G An and f G rat be such that (Vi5, 0 < <5 < ^) [P(Go -I- <5ui) C 
Claim (2), such Ga and ^ exists. Let Gb = Ga + fvi. Let 
A' = {Ga + avi |0<a<^ A aG rat}. Let £' = [Lc U IJceyi I G' G A'}. 

For each 2 G rat, Xi{Ga) < z < Xi{Gb) there exists a language in £' 
which differ from UGe4 by the set {(l,y) | y G Since there exists an 

m G N such that (Vp G PRIMES where p > m){3 I G N \ I is co-prime with 
p)[Xi{Ga) < I < Xi{Gb)], the set | 2 : G rat, Xi{Ga) < z < Xi{Gb)} 

includes all the r.e. languages. Thus if £' is in TxtEx, then the set of all the 
r.e. languages would be in TxtEx. It follows that £' cannot be in TxtEx. Since 
£' C ExtTRANSIMn""^^, ExtTRANSIMn U”+^TxtEx. | 

Corollary 1. For all n G N~^ . 

(a) TxtEx C If" TxtEx. 

(h) D TxtEx d Dir TxtEx. 

(c) WDir+^ TxtEx C WDir TxtEx. 

Proposition 3. For all n G N^. 

(a) ( WDir TxtExr DIT TxtEx) - DIT+'^ TxtEx ^ 0. 

(h) ( ir TxtEx n WD ir TxtEx) - WD IT+^ TxtEx yf 0. 

Proof. For part (a), the case of n = 1 is shown by FIN. Let n G N and 
n > 2. Let {Gi, G 2 , . . . , G„} C be a collection of n simplexes and let 
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Go be such that P{Gq) C \Jl^iP{Gi). Such Go,Gi,...,G„ exists by Claim 
(2). Let C = {Lang{Gi) | 0 < i < n}. It is easy to verify that C G 
DU"TxtEx n WDU*TxtEx. However, since a text for lJ"^o Lo-ngiGi) is also 
a text for Ur=i Lang{Gi), by Proposition [T] C ^ DU"''''^TxtEx. 

For part (b), let n G ./V+. For i,k £ N, let Ai^k = {{\}l{n+ 1)J • (n + 1) + 
A ihk)) \ j £ N AO < j <n}Li {{i,x) | a; G A^}. 

Given total g : N ^ N and i £ N, let Li^g = Ai^g(^iy Let Cg = {Li,g \ i £ JV}- 
It is easy to verify that {£g j g : N s- iV} C U*TxtEx n WDU"TxtEx. Note 
that for all g, g , L[^n+i)*e+j^g LJj<n {(l I 4“ 1) ^ e ^ 

z < (n + 1) * (e + 1) and x £ N}. 

Now define g such that for all e, {L(^n+i)*e+j,g \ j ^ n} is not the set of 
languages to which Mg converges on {{i,x) \ {n+ 1) * e < i < (n + 1) * (e + 1) 
and X £ N}. Note that such g can be easily defined. 

Thus, Cg ^ WDU”+^TxtEx. | 

Corollary 2. (Vn £N,n> 2)[DlT TxtEx C WDlTTxtEx C iTTxtEx], 

4.1 Disjointness 

It may be argued that languages in a union only fail to be discerningly identifiable 
as a result of crucial information regarding one language being lost within the 
other languages; that is, when all the “important” members (such as Lang{G) 
in the proof of Proposition of one language are also members of some other 
languages in the union. It is natural to ask if disjointness would be a sufficient 
condition for unions of languages to be discerningly identifiable. The following 
result answers this in the negative. 

Theorem 1. For all n £ iV+, there exists C £ DEP TxtEx where 

(a) 0 ^ 

(b) (VL,G G C)[LnL' = 0], 

such that C ^ TxtEx. 

Proof. Unless stated otherwise, let e, i,j, with or without decorations, range over 
N, and S, with or without decorations, range over finite sets. For each IIM M^, 
we construct Se, L°, Ll, ... ,L^ where 

Le = {(e,0,0)} U {(e,z, j) | 1 < i < n, j £ Se} 
and for 1 < z < n, P} satisfies the following two properties: 

(1) kie = {(Cibj) I j G ^mm{{TTi{x)\x^L\})} 

(2) min({7r3(a:) | x £ L\}) > max(S'e). 
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Let C = {Lg, Lg, Lg I e G N}. It is clear that for all L, V G C, L D L' = 0. 
We now show that C G DU"TxtEx. Let g : N he a recursive function 

such that for each e G N, 

Wg(e,o,S) = {(e,0,0)} U {(e,i, j) | i G 1 < z < n A j G S']} 
and for each z G N~^, 



= {{e,i,k) \ k G Wj}. 

Now C G DU"TxtEx is witnessed by following M. 

M{T[m]) : 

S ^ 0. 

A ^ {j G N \ (die G co7ztent(r[TO]))[7ri('u;) = j]|. 

For each e G A do 

B <— content {T[m]). 

If (e, 0,0) G content{T[m]) then 

C' <— {j I (Vz, 1 < z < zz)[(e,z,j) G content (T[m])]}. 

S ^ SU{5(e,0,C)|. 

B ^ B — Wg(^e,0,C)- 
For z <— I to zz do 

If exists jo such that (e,i,jo) € B, then 

For minimum such jo, let S ^ SU {5(e,z,jo)}- 

Output S. 

It is easy to verify that M DU"TxtEx-identifies C. 

We now show that C ^ U"~'’^TxtEx. For each Mg here is the construction to 
show that Mg does not U”^^TxtEx-identify C. By Kleene’s Recursion Theorem 
there exists an index e' such that We' may be defined in stages s = 0,l,2..., as 
below. For each s, W^, denotes the finite portion of VFg' enumerated just before 
stage s. 

Stage 0: Let = (e, 0, 0)o(e, 1, e')o(e, 2, e')o. . .o(e, n, e'). Let W^, = {e'j. 
Go to stage 1. 

Stage s: Search for t where content{r) C {(e, z,}) \ 1 < i < n A j > 
max(VFg/)} such that Me{cr^) yf Mg(cr'*or). If and when t is found, 
enumerate {j \ (3z', I < i' < n)[{e,i',j) G content(r)]} into Wg', 
and let be an extension of cr'* such that eontent{a^~^^) = 

{(e, 0,0)1 U {(e,z,j) |l<z<rzAjG We' enumerated up to 
now}. Go to stage s + 1. 

If the search for r failed at any stage s, then let Lg = content(a^), let 
e" > max(VFg,) be such that min(LFg") = e". For each z G IV, 1 < z < n, let 
LI = {(e,z,j) I j G We"}. Since stage s does not succeed. Mg does not identify 
at least one of Lg and Lg U Ur=i If the search is successful at all stages, let 
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Lg = {(e,0,0)} and for each i let L\ = {{e,i,x) \ x G We>}, then Mg fails to 
converge on the input cr'*, a text for U Ur=i I 

In obtaining the above result we have used a class of languages that is not re- 
cursively enumerable. It remains to be seen if for recursively enumerable classes 
of languages, disjointness can be sufficient as a condition for U”TxtEx identi- 
fication for any n > 1. The following, however, shows the contrary. 

Example 1. Let 

Lxfl = {(a;,0)} U {(a;,y) | (Vz < y)[^x{z)l]} 

T _ ] {{x,y+l)}ii {x,y + l) 

x,v+i ^ otherwise 

Let C = {L^ i I G N}. Clearly, C G TxtEx. Now suppose there exists M 
such that G TxtEx(M), then (3cr | content{a) C L^p) [(Vt | content{T) C 
{{x,i) I i G N})[M{a) = M{t)]] tpa, G 7?.. The condition on the left hand side 
is E 2 to check. However, the set {x \ ipx is recursive} is not E 2 , a contradiction. 

We note that the class of languages in Example [U is not 1-1 recursively 
enumerable. As will be shown by our next result, for a 1-1 recursively enumerable 
class of languages, disjointness is a sufficient condition for the class to be in 

DU*TxtEx. 

5 Sufficient Conditions for DUTxtEx Identification 

5.1 Functions That Enumerate Distinguishing Elements 

Let recursively enumerable £ C £ he given. Suppose for all L G £, there is an 
effective procedure to enumerate an element which is uniquely in L, that is, no 
other language in £ contains this element. Can we then identify every collection 
of languages drawn from £1 An answer is attempted in the following proposition. 

Proposition 4. Let £ he a 1-1 recursively enumerable class of languages as 
witnessed hy the computable numbering if. If there exists a limiting recursive 
function d and total recursive F for which d{i) = lim^^oo F{i,t) such that 

(a) (Vt G N)[d{i) G Wf], 

(b) (yi,j G N)[d{i) G * = j], and 

(c) (yj G N)[card{{F{i,t) | t,t G A^} C W^) < 00 ]. 

Then £ & DU* TxtEx. 

Proof. Let recursive function h witnesses that if < <p. Define M as follows. 

M(T[m]) : 

5 ^ 0 . 

For t = 0 to m do 
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If [F(i, m) G content{T[m]) and 

<m A / < m A i' ^ f A F(i,m) G H 

Then S' ^ SU{h(i)}. 

Output S. 

Let r be a text for L = Wf , a union of card{D) languages from C and let 

A = range(-F') H L. Since each language in L intersects with only finitely many 
outputs of F, card{A) < oo. Intuitively, A contains all the potential “distin- 
guishing element” s M will encounter during the identification process. Since D 
and A are finite, there exists n G N so large that 

(1) (yt > n) (Vz G D) [F(i,t) = d{i) A d{i) G content{T[t]) O Wf^\. 

(2) (Vn' > n)(Vx G A-{d{k) \ k G D})[{3j G N - D)[x G ^ < 

nWyfAxGWl^,nW^,^J] 

Clause (1) ensures that all z G O will eventually be output by M. Clause (2) 
ensures that all programs j ^ D, which enumerate some element in A are ex- 
cluded from consideration (note that every element in A is enumerated by some 
program in Z?). 

Hence for all n' > n, i G D if and only if z G M(T[rz']). It follows that M 
DU*TxtEx-identifies C. | 

Corollary 3. Let C he a class of languages for which there exists a 1-1 num- 
bering and 

(a) 0 ^ C, 

(h) (yL,V G C)[Ly L' ^ LnL' = 0], 

Then C G DU* TxtEx. 

Proof. Let C = {Wf \ i G N} where ip is a, 1-1 numbering for C. Let F{i,t) = 
min(W'/’j) and let d{i) = liTOt^ao F{i,t). Clearly, (a) (Vz G N)[d{i) G Wf], 
(b) (Vz,j G N)[d{i) G Wf z = j], and (c) (Vj G N)[card{{F{i,t) \ i,t G 
iV} n W^) < oo]. Thus d fulfills all the conditions for Proposition [d] Hence 

C G DU*TxtEx. I 

Corollary 4. Let C be an indexed family of recursive languages where 

(a) 0 ^ C, 

(b) {\/L,L' G C)[Ly L' ^ LnL' = 0j. 

Then C G DU* TxtEx. 

In Proposition U some weaker conditions for (a) and (b) may not be sufficient, 
even if we strengthen (c) to require that d is a recursive function. For instance, if 
we have only the following conditions (where the requirement for (b) is relaxed) : 

(a) (Vz G N)[d{i) G Wf] 

(b) (Vz G N)\card{{Wf \ d{i) G Wf}) < oo], 

(c) d is recursive. 
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then identifiability for cannot be guaranteed, as the following example shows. 
Example 2. Let 

Lq = {(0, x) I a; G A^} U {(1, x) I X € iL} 

^ /{(0,i+l)}U{(l,t)}U{(2,i)}ifiGiL 
{(0, i + 1)} U {(1, t)} otherwise 

Let C = {Li I i e N} and define d such that (Vx G N)[d{x) = (0,x)]. It is easy 
to verify that (a) £ is 1-1 recursively enumerable, (b) £ G TxtEx, (c) is 
uniquely definable from £, and (d) d satisfies the weaker condition given above 
for £. However, for all k G N, the language {(0, i) | i G iV}U{(l, x) | x G KU{k}} 
is in £^, hence £^ is unidentifiable. 

A similar weakening of these conditions, where instead of a single unique 
element d is required to name only a set of elements which is unique to each 
language in the class, as in the following: 

(a) (V* G C Wf] 

(b) (Vz,jGfV)[i?4qCH//^i=j]. 

(c) d is recursive. 

then such a function will also fail to guarantee that G TxtEx, as demon- 
strated by the following example. 

Example 3. Let 

Lo = {(0,0)}U{(l,x) IxGlV} 

= {(1, 1)} U {(0,x) I X G IV} U {(2,x) I X G AT} 
r . ^ / {(0a + 2)1 u {(l,x) I X G iV} U {(2,i)| U {(3,i)| \i i G K 

1 {(Oj* + 2)1 U {(1, x) I X G fVj U {(2, i)| otherwise 

Let £ = [Li I i G N{ and define d such that (Vx G N)[d{x) = {(0,x), (l,x)}]. 

It is easy to verify that £ is a 1-1 recursively enumerable class of languages 

in TxtEx where all the languages in are uniquely definable from £, and 
that d satisfies all the prescribed conditions for £. However, for all k G N, 
{(0, i), (1, i) \ i G N} U {(2, x) | x G AT U {fc}} is in £^, thus is not in TxtEx. 

5.2 Restrictions on Structures of Languages 

Proposition 5. Given n G . Let £ he a class of languages such that 

(a) every language in £” is uniquely definable from £, 

(b) (VA G £)[card{{L' G £ | A' n A yf 0}) < oo], 

(c) there exists a computable numbering if for £ such that: 

(1) (VA G £)[card(A) = oo card{{i \ Wf = A}) = 1]. 

(2) (VA G £)[card(A) < oo card{{i \ Wf = A}) < oo]. 

then £ G DIE TxtEx. 

Proof. We let A and B, with or without decorations, range over FIN. Let h 
witnesses that if < p. 
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M{T[m\) : 

Let C"* = {i\i <m A n content {T[m]) ^ 0}. 

Let Candidates'^ = {•S' C C™ | card{S) < n}. 

Let So = max({s | (35' G Candidates''')[ IJigs ^ts — content{T[m\) 

^ UeS^Mn - content{T[s])]}). 
Output {h{i) I i G Ofcg}, where fco = niin({fc | Dk G Candidates”' 

A - content{T[m]) 

A Uier>^ ^tm 2 content {T[so\)}). 

Intuitively, M outputs the seemingly best grammar set in Candidates'” which 
describes the input text. Let T be a text for L = IJjg^ W/", a union of card{B) < 
n languages from C. We divide B into two groups, i?i = (i G i? | Wf is finite} 
and i ?2 = {* G .B I Wf is infinite}. By the requirement of xp, for each i G B\, 
there exist only finitely many j such that Wf = , and for each i G B 2 , for all 

j y^i, wf ^Wf Wet A = {A \ Wf = ^f}- Since C” is uniquely 

definable from C, the only sets of languages which are capable of generating L 
are {B 2 A \ A G -4}. Let Correctind = {B 2 U A | ^ G A}. 

Let C = {i\ wf n contentiT) yf 0}. Since each language in {wf | i G B} 
intersects with only finitely many other languages in C, C is finite. It is easy to 
verify that there exists no G N such that for all n' > no, C” = C” = C . Let 
ni G N, n\ > no be so large that 

(Vi G C')[Wf is finite ^ Wf = Wf^^ AWf C content {T[m])] 

Let Candidates' = Candidates”" . Clearly, for all n' > no, Candidates” = 
Candidates” = Candidates' . Let ri 2 > n\ be so large that 

^[(3A G Candidates' — CorrectInd)[ (UieA 

2 content{T[n 2 ]))]] 

Let no > ri 2 be so large that 

[ IJ Wfu 2 +i ^ content{T[n 3 \) A |J Wf^^ A content{T[n 2 + 1])] 

ieB ieB 

Clearly, for all n' > no, {D G Candidates' \ UieD WCti 2 +i — content{T[n']) 
A A content{T[n 2 + 1])} = Correctind. Hence for all n' > no, M 

outputs min({/c | Dj, G Correctind}). It follows that M DU”TxtEx-identifies 

•C. I 

Corollary 5. Fix n G IV"*". Let L = {Li \ i G N} be a 1-1 recursively enumer- 
able class of languages where 

(a) every language in C” is uniquely definable from C. 

(b) (Vi G N)[card{{j \ Li n Lj yf 0}) < 00 ]. 

Then L G DU” TxtEx. 
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The conditions in Proposition [5] are clearly not necessary — as is evident 
by TRANSIMn, which is 1-1 recursively enumerable but every language in the 
class intersects with infinitely many other languages within the class. 



6 Intrinsic Complexity 



The concept of intrinsic complexity [FKS95lJS96lKPSW99lJKW00lJK0T] is an 
attempt to describe the relative hardness of identifying a class of languages under 
the requirement given by an identification criterion. The idea is to reduce the task 
of X-identifying a class of languages to the task of >7-identifying another class. 
To be able to reduce the identification of C to that of identifying we should 
be able to transform X-admissible texts T for languages in C to j7-admissible 
texts T' for languages in C and further transform J^-admissible sequences for 
T' into X-admissible sequences for T. 

An enumeration operator (or just operator), 0, is an algorithmic mapping 
from SEQ into SEQ such that for all a, t G SEQ, if cr C r, then 0(a) C 0 ( t ). 
We further assume that for all texts T, lim„^oo = oo. By extension, we 

think of 0 as also defining a mapping from T to T such that 0{T) = 0(T[n\). 

If for a language L, there exists an L' such that, for each text T for L, 0(T) is 
a text for L' , then we write 0(L) = L' . 

|JS96| distinguished between two kinds of reductions, called weak and strong 
reductions. We consider only the former here. 

We extend the definition for weak reduction as follow, so that instead of 
just reducing the task of identifying every language in a class, £i say, to tasks 
of identifying languages in another class £ 2 , we want to reduce the task for 
identifying every language in £” to tasks of identifying languages in £™, for 
some m,n G N . 



Definition 11. Let £i ,£2 C £ be given. Let K,\,K ,2 G {U,DU,WDU} and 
n,m G iV+ be given. Let Tj = {T | T is a text for £ S £"}. Let 72 = {T | £ is 
a text for L G £™}. We say that £1 TxtEx there 

exist operators 0 and S such that for all T S 7j and for all infinite sequences 
of conjectures Q the following hold: 

(a) 0(T) G T 2 , and 

(b) if t/ is a /C™TxtEx-admissible sequence for 0(T), then S(Q) is a /C”TxtEx- 
admissible sequence for T. 



We say that £1 £2 if and only if £1 



TxtEx 



Definition 12. |JKW00J Let X be an identification criterion. Let £ C be 
given. 

(a) If for all £' G I, £' <£„, £, then £ is <£„,-hard. 

(b) If £ is <J^^j^-hard and £ G X, then £ is <J^^^-complete. 
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Proposition 6. For all n G 

(a) INIT is complete. 

(b) INIT is ^ -complete. 

(c) INIT is 



Proposition 7. (Vn G N,n> 2)[TRANSIMn is '^^^^'^-complete\. 
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Abstract. We consider a learning model in which each element of a 
class of recursive functions is to be identified in the limit by a com- 
putable strategy. Given gradually growing initial segments of the graph 
of a function, the learner is supposed to generate a sequence of hypothe- 
ses converging to a correct hypothesis. The term correct means that the 
hypothesis is an index of the function to be learned in a given numbering. 
Restriction of the basic definition of learning in the limit yields several 
inference criteria, which have already been compared with respect to 
their learning power. 

The scope of uniform learning is to synthesize appropriate identification 
strategies for infinitely many classes of recursive functions by a uniform 
method, i.e. a kind of meta-learning is considered. In this concept we can 
also compare the learning power of several inference criteria. If we fix a 
single numbering to be used as a hypothesis space for all classes of re- 
cursive functions, we obtain results similar to the non-uniform case. This 
hierarchy of inference criteria changes, if we admit different hypothesis 
spaces for different classes of functions. Interestingly, in uniform identi- 
fication most of the inference criteria can be separated by collections of 
finite classes of recursive functions. 



1 Introduction 

Inductive Inference is concerned with theoretical models simulating learning pro- 
cesses. A model of quite simple mathematical description is for example identi- 
fication of classes of recursive functions. This concept in general includes three 
main components: 

— a partial-recursive function S - also called strategy - simulating the learner, 

— a class U of total recursive functions which have to be identified by S, 

— a partial-recursive numbering tp - called hypothesis space - which enumerates 
at least all functions in U. 

In each step of the identification process S is presented a finite subgraph of 
some unknown arbitrary function / contained in U; the strategy S then returns 

N. Abe, R. Khardon, and T. Zeugmann (Eds.): ALT 2001, LNAI 2225, pp. 251-|26^ 2001. 
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a hypothesis which is interpreted as an index of a function in the given num- 
bering ijj. It is the learner’s job to eventually return a single correct hypothesis, 
i.e. the sequence of outputs ought to converge to a ■i/j-number of /. This model 
- called identification in the limit - has first been analyzed by Gold in and 
gave rise to the investigation and comparison of several new learning models 
(“inference criteria”) basing on that principle. The common idea was to restrict 
the definition of identifiability by means of additional - and in some way natu- 
ral - demands concerning the properties of the hypotheses. The corresponding 
models have been compared with respect to the resulting identification power; 
for some more background the reader is referred to 0, H and [Z]. 

This paper studies Inductive Inference on a meta-level. Considering collec- 
tions of infinitely many classes U of recursive functions we are looking for meta- 
learners synthesizing an appropriate strategy for each class U to be learned. For 
that purpose we agree on a method to describe a class U, because for the synthe- 
sis of a learner our meta-strategy should be given some description of U. That 
means we do not only try to solve a learning problem by an expert learner but 
to design a higher-level learner which constructs a method for solving a learning 
problem from a given description. Thus the meta- learner is able to simulate all 
the expert learners. 

Uniform learning of classes of recursive functions has already been studied by 
Jantke in [^. Unfortunately, his results are rather negative; he proves that there 
is no strategy which - given any description of an arbitrary class U consisting 
of just a single recursive function - synthesizes a learner which identifies U with 
respect to a fixed hypothesis space. Even if we allow different hypothesis spaces 
for the different classes of recursive functions, no meta-learner is successful for 
all descriptions of finite classes (cf. [12]). Since in the non-uniform case finite 
classes can be identified easily with respect to any common inference criterion, 
these results might suggest that the model of uniform learning yields a concept 
the investigation of which is not worthwile. As we will see, the results in this 
paper allow a more optimistic point of view. Of course it is quite natural to 
consider the same inference criteria known from the non-uniform model also in 
our meta-level. The aim of this paper is to investigate whether the comparison 
of these criteria concerning the resulting identification power yields hierarchies 
analogous to those approved in the classical context. In most cases we will see, 
that the classical separation results can be transferred to uniform learning. And 
we can prove even more. If we consider uniform learning with respect to fixed 
hypothesis spaces, all separations of inference criteria can be achieved by collec- 
tions of finite classes of recursive functions. The resulting hierarchies correspond 
to the non-uniform case. If we drop the restrictions concerning the hypothesis 
spaces, we obtain slightly different results, although many of the criteria can 
still be separated by finite classes. So whereas finite classes are very simple re- 
garding their identifiability in Gold’s model, they are in most cases sufficient 
for the separation of inference criteria in uniform learning. Furthermore we con- 
clude that the hierarchies obtained are very much influenced by the choice of 
the hypothesis spaces. Now, since the hierarchies of inference criteria do not 
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collapse in our meta-level - even if we restrict ourselves to the choice of simple 
learning problems - we conclude that the concept of uniform learning is neither 
trivial nor fruitless. Furthermore this paper corroborates the interpretation that 
our different inference criteria possess some really substantial specific properties, 
which yield separations of such a strong nature that they still hold for uniform 
learning of finite classes. 

In |12| the reader may also find positive results encouraging further research. 
It is shown that the choice of descriptions for the classes U has more influence on 
the uniform identifiability than the classes themselves, i.e. many meta-strategies 
fail rather because of a bad description of the learning problem than because 
of the complexity of the problem. So it might be interesting to find out what 
kinds of descriptions are suitable for uniform learnability and whether they can 
be characterized by any specific properties. 

Further research on uniform identification has also been made in the context 
of language learning, see for example 0, 0 and [2j. Because of its numerous 
positive results, in particular the work of Baliga, Case and Jain P] motivates 
the investigation of meta-strategies. 

2 Preliminaries 

Recursion theoretic terms used without explicit definition can be found in unj. 

By N we denote the set of all nonnegative integers, N* is the set of all finite 
tuples over N; the variable n always ranges over N. For fixed n, the notion 
N" is used for the set of all n-tuples of integers. By implicit use of a bijective 
computable function cod : N* i— > N we will identify any a G N* with its coding 
cod(a) G N. A statement is quantified with V°°n in order to indicate that the 
statement is fulfilled for all but finitely many n; quantifiers V and 3 are used in 
the common way. 

For any set X the expression card X denotes the cardinality of X ; pX denotes 
the set of all subsets of A. As a symbol for set inclusion we use C, proper inclusion 
is indicated by C . Incomparability of sets is expressed by # . 

The set of all partial-recursive functions is denoted by V, the set of total 
recursive functions by TZ. In order to refer to functions of a fixed number n of 
input variables, we sometimes add the superscript n to these symbols. For any 
f G V, a; G N we write f{x)l, if / is defined on input x', /(x)t otherwise. If 
/ G and n fulfill /(0)|, . . . , f{n)l we set f[n] := cod(/(0), . . . , /(n)), i.e. f[n] 
corresponds to the initial segment of length n -I- 1 of /. Comparing f^gGVwe 
write / =„ g, if {{x,f{x)) \ x < n, f{x)l} = {{x,g{x)) \ x < n, g{x)[}-, otherwise 
/ g. By the notion / C 5 we indicate that {{x,f{x)) | a; G N, f{x)i} C 
{(a:, 5 (a:)) I a; G N, g(a;)J,} and use proper inclusion by analogy. But f GV may 
also be identified with the sequence (/(n))„gN, so we sometimes write / = 
and the like. We often identify a tuple a G N* with the function o;t°° implicitly. 
By rng(/) we refer to the range {f{x) | x G N, f{x)l} of a function f GV. 

A function t/j G is used as a numbering for the set V^ := {tpi \ i G N}, 

where i/>i(x) := for all i G N, x G N" as usual, i is called ■i/j-number of 
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the function tpi. In order to refer to the set of all total functions in we use 
the notion TZ^, i.e. TZ^ := n TZ. TZ^ is called the recursive core or “7?.-core” 
of V^. If V' S every 5 G N corresponds to a numbering G 'pn+i ^ 

define x) := ip{b, i, x) for alH G N, x G N”. Again i is a ^^-number for the 
function ip'^ defined in the common way. 

Now we introduce our basic Inductive Inference criterion called identification 
in the limit, which was first defined in [^. It may be regarded as a fundamental 
learning model from which we define further restrictive inference criteria (see 
Definitions |2] and ED- The notation EX in Definition [T] abbreviates the term 
“explanatory identification” which is also used to refer to learning in the limit. 

Definition 1. Let C 7?,, ijj G V^. The class U belongs to EX^ and is called 
identifiable in the limit wrt the hypothesis space ip iff there is a function S G V 
(called strategy) such that for any f G U : 

1. S{f[n])l for all n gN (S{f[n\) is called hypothesis on f[n]), 

2. there is some j G N such that ipj = f and [S{f[n]) = j]. 

If S is given, we also write U G EX,p{S). We set EX := EX^j,. 

On any function f G U the strategy S must generate a sequence of hypotheses 
converging to a ^-number of /. But a user reading the hypotheses generated by 
S' up to a certain time will never know whether the actual hypothesis is correct 
or not, because he cannot decide whether the time of convergence is already 
reached. If there was a bound on the number of mind changes, he could at least 
rely on the actual hypothesis whenever the bound is reached. Learning with such 
bounds has first been studied in [3]. 

Definition 2. Assume U C TZ, ip G T’^ , m G N. f7 belongs to and is 

called identifiable (in the limit) with no more than m mind changes wrt ip iff 
there exists an S G 'P satisfying 

1. U G EX,jj{S) (where S is additionally permitted to return the sign “?”), 

2. for all f G U there is an Uf gN satisfying 

-Wx<Uf [S{f[x]) =?], 

-'ix>Uf [S{f[x]) G N], 

3. card{n G N | ? yf S(/[n]) yf S(/[n + 1])} < m for all f GU. 

We use the notations {EXm)ii>{S) and EXm by analogy with Definition\T) A 
class U C TZ is identifiable with a bounded number of mind changes iff there is 
an m G N such that U G EX^- 

The output “?” allows our strategy to indicate that its hypothesis is left open 
for the actual time being, in order not to waste a mind change in the beginning 
of the learning process. 

It is also a natural thought to strengthen the demands concerning the in- 
termediate hypotheses themselves. A successful learning behaviour might be to 



On the Comparison of Inductive Inference Criteria for Uniform Learning 255 



generate intermediate hypotheses agreeing with the information received up to 
the actual time of the learning process (“consistent” hypotheses, cf. m) . In order 
to be less demanding, one could also ask for hypotheses which do not disagree 
convergently (i.e. in their defined values) with the actual information (“conform” 
hypotheses, see m) . Since any hypothesis representing a function not contained 
in TZ must be wrong, another natural demand would be to allow only '(/j-numbers 
of total recursive functions (“total” hypotheses, cf. 0) as outputs of S. Since 
in general the halting problem in ip is not decidable, it might be hard for our 
strategy to detect the incorrectness of a hypothesis, if the corresponding func- 
tion differs from the function to be learned only by being undefined for some 
arguments. For learning with “convergently incorrect” hypotheses (cf. [Ij) such 
outputs are forbidden. 

Definition 3. Choose a pair (I,Ci) from those listed below. Let U QTZ, G . 
U is called identifiable under the criterion I wrt ip iff there exists a strategy S G V 
such that U G EX.^{S) and for all f GU, n € N condition Cj is satisfied. 



/ 


Cl 


CONS 

CONE 

TOTAL 

CEX 


'0S(/H) =n / 

Vx < n [lpS(f[n]){x)l 1ps(f[n]){x) = f{x)] 

i’Siflu]) G 

f 



We use the phrases “identification with consistent ( conform, total, convergently 
incorrect) intermediate hypotheses” respectively. The notations I, I^i,, are 

used in the common way. 

For the inference criteria introduced here the following comparison results 
have been proved: 

Theorem 1. 1. Vm G N [EX^ C EX^+i C EX] (see 

2. TOTAL c CONS C CONE c EX c pTl (se^, m and M)- 

3. TOTAL C CEX C EX (see definitions and f^). 

I CEX# CONS (see 

For convenience I := {EX, CONS, CONF, TOTAL, CEX} U |EX^ | m S Nj 
denotes the set of all inference criteria declared in this section. Furthermore let 

— J* := {Lf C TZ \ U is finite}, 

- Ji := {[/ C 7^ I card U = 1} {= {{/} | / S 7^}). 

3 Uniform Learning — Definition and Basic Results 

From now on let (p G 'P^ he a fixed acceptable numbering of 'P'^ and t G 
an acceptable numbering of . As is acceptable, it might be regarded as a 
numbering of all numberings ip G every 6 G N corresponds to the function i#’ 
which is defined by gfi{i,x) := (p{b,i,x) for any z,a; € N. Thus b also describes 
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a class TZb of recursive functions, where TZb := = V^h n TZ; i.e. TZb is the 

recursive core of Vipt. Therefore any set B C N will be called description set 
for the collection {TZb \ b G B} of recursive cores corresponding to the indices 
in B. Considering each recursive core as a set of functions to be identified, any 
description set i? C N may be associated to a collection of learning problems. 
Now we are looking for a meta-learner which - given any description b G B 
- develops a special learner coping with the learning problem described by b, 
i.e. the special learner must identify each function in TZb- 

Definition 4. Let J C pTZ, I G T, J C I, B C N. The set B is called suitable 
for uniform learning wrt J and I iff the following conditions are fulfilled: 

1. WbGB [TZb G J], 

2. 3S gV^ ybG B gV^ [TZb G mXx.S(b,x))]. 

We abbreviate this by B G suit{J,I) and write B G suit{J, I){S), if S is given. 

So B G suit(J, J) iff every recursive core described by some index b G B 
belongs to the class J and additionally there is a strategy S GV^ which, given 
b G B, synthesizes an /-learner successful for TZb with respect to some appropriate 
hypothesis space if. Note that the synthesis of these appropriate hypothesis 
spaces is not required. This means in particular, that in general the output of a 
meta-learner cannot be interpreted practically, because we might not know which 
numbering is actually used as a hypothesis space. Of course we might restrict 
our definition of suitable description sets by demanding uniform learnability with 
respect to the acceptable numbering r for all classes TZb- Another possibility is 
to use the numberings b G B, already given by the description set B as 
hypothesis spaces for /-identification of the classes TZb- 

Definition 5. Let L Gl, J I, C N, S GT’^. Assume B G suit{J, L){S) . 
We write B G suitr{J, I){S) if TZb G fy(Ax.S'(6, a;)) for all b G B. The notation 
B G suitip{J,I){S) shall indicate that TZb G I^b{Xx.S{b,x)) for all b G B. We 
also use the notations suitriJ,!) and suit^{J,I) in the usual way. 

Of course it would be nice to find characterizations of the sets suitable for 
uniform learning with respect to J, /, where L G T and J Q I are given. This 
paper compares the uniform identification power of several criteria L G T and 
concentrates on the case J = J* , i.e. all recursive cores to be identified are finite. 
Our first result follows obviously from our definitions. 

Proposition 1. Let L G T, J C I. Then suit^p{J,L) C suitr{J,I) C suit{J,I). 

Whether these inclusions are proper inclusions or not depends on the choice 
of J and I. If they turned out to be equalities for all J and /, then Definition 
0 would be superfluous. But in fact, as Theorem will show, we have proper 
inclusions in the general case. That means that a restriction in the choice of the 
hypothesis spaces results in a restriction of the learning power of meta-strategies. 
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Any strategy identifying a class C/ C 7?. with respect to some criterion 
/ G I\{CONS, CONF} can be replaced by a total recursive strategy without 
loss of learning power. This new strategy is defined by computing the values of 
the old strategy for a bounded number of steps and a bounded number of input 
examples with increasing bounds. As long as no hypothesis is found, some tempo- 
rary hypothesis agreeing with the restrictions in the definition of I is produced. 
Afterwards the hypotheses of the former strategy are put out “with delay” 0 
Now we transfer these observations to the level of uniform learning and get the 
following result, which we will use in several proofs: 

Proposition 2. Let I G 1\{C0NS, CONF}, J C I, B C N. Assume B G 
suit{J,I) (suitr{J , I) ) ■ Then there is a total reeursive funetion S sueh that B G 
suit{J,I){S) (suiCiJ, I){S) respeetively). 

Let us now collect some simple examples of description sets suitable or not 
suitable for uniform learning. First we consider the identification of classes con- 
sisting of just one recursive function. Any set describing such classes turns out 
to be suitable for identification under any of our criteria: 

Theorem 2. Let L G T. Then suit{J^,I) = {B C N | TZt G for all b G Bj. 

Proof. Let i? C N fulfill TZi, G for all 6 G B. Since for all / G 7?. there 
exists a numbering ip G with ipQ = /, the strategy constantly zero yields 
B G suit(J^,7). Thus {i? C N | 7?.;, G for all & G 77} C suit(J^,7). The other 
inclusion is obvious. □ 

Unfortunately, we would rather not regard the strategy defined in this proof 
as an “intelligent” learner, because its output does not depend on the input at 
all. Its success lies just in the choice of appropriate hypothesis spaces. If such a 
choice of hypothesis spaces is forbidden, we obtain an absolutely negative result: 

Theorem 3. {& G N | 7?.f, G ^ suiU{J^, EX). 

In particular even {6 G N | card G N | G TZ} = 1} ^ suitr{J^, EX). 

For a proof see |B] or m- So, if we fix our hypothesis spaces in advance, 
not even the classes consisting of just one element can be identified in the limit 
uniformly. Regarding the identification of arbitrary finite classes (the learnability 
of which is trivial in the non-uniform case), the situation gets worse still. Even by 
free choice of the hypothesis spaces we cannot achieve uniform EX-identifiability. 

Theorem 4. {& G N | 7?.f, G J*| ^ suit{J*, EX). 

A proof can be found in m- How can we interpret these results? Is the con- 
cept of uniform learning fruitless and further research on this area not worthwile? 
Fortunately, many results in |5] and m allow a more optimistic point of view. 

^ This does not work for CONS and CONF, since in general after the delay the hy- 
potheses are no longer consistent or conform with the information in the actnal time 
of the learning process. 
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For example, jT2] shows that some constraints on the descriptions b G B - espe- 
cially concerning the topological structure of the numberings - yield uniform 
learnability of huge classes of functions, even with consistent and total intermedi- 
ate hypotheses and also with respect to our acceptable numbering r. The sticking 
point seems to be that uniform identifiability is not so much influenced by the 
classes to be learned, but by the numberings (p^ chosen as representations for 
these classes. So the numerous negative results should be interpreted carefully. 
For example the reason that there is no uniform EX- learner for {6 G N | TZb G J*} 
is not so much the complexity of finite classes but rather the need to cope with 
any numbering possessing a finite 7?.-core. Based on these aspects we should not 
tend to a pessimistic view concerning the fruitfulness of the concept of uniform 
learning. Our results in the following sections will substantiate this opinion. 

Theorems E] and E] now enable the proof of the following example of a strict 
version of Proposition [T] 

Theorem 5. suit^{J^,I) C suitr{J^ , I) C suit{J^,I) for all I gT. 

Proof, suit,- (J^, 7) C suit(J^,7) is obtained as follows: by Theorem we know 
that := {6 G N I TZi, G J^} ^ suit.r(>^^) (otherwise Bi was also an element 
of suitT-(J^,EX)). Thus by Theorem [^we obtain Bi G suit( J^, J)\suitT-( J^, /). 

It remains to prove suit,^( J^, /) C suitr( 7). Again by Theorem|^we know 
that there exists a set 73 C N such that card {f G N | G TZ} — 1 for all 6 G 73 
and 73 ^ suit,^(J^,EX). Now let g GTZ he a, computable function satisfying 






if Pi{y)i for all y < X 
otherwise 



for any 6 , i, x G N . 



Let 73' := {g{b) \ b G 73}. Since TZg(b) = {0°°} for 6 G 73, we get 73' G suit,-(J^,7) 
(via a strategy which constantly returns a r-index of the function O'^). 

Obviously {7 G N | G TZ} = {7 G N | G TZ} for all 6 G N. If 

there was a strategy S G T’’^ satisfying 73' G suites (J^, 7) (S'), we would achieve 
73 G suit^(J^,EX)(T) by defining T{h, f[n]) := S(<7(&),0") for f GTZ, b,n G N. 
This contradicts the choice of 73, so 73' G suit,-( J^, 7)\suit^( 7)(S). Hence 

suit(p(J^,7) C suitT(J^,7). □ 



4 Separation of Inference Criteria — Special Hypothesis 
Spaces 

From now on we will compare the learning power of our inference criteria for 
uniform learning of finite classes of recursive functions, i.e. we try to find re- 
sults in the style of Theorem [T] where the criteria I G I are replaced by the 
sets suit(J*,7), suit,-(J*,7) or suit^(J*,7). Please note that a separation like 
for example suit,- (CONS, CONS) C suit,- (CONS, EX) is not a very astonishing 
result. The remarkable point is that even collections of finite classes of recursive 
functions suffice for a separation (note that in the non-uniform case finite classes 
can be identified under any criterion 7 G X easily) . 
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Since all proofs for the theorems stated in Section 4 proceed in a similar 
manner and include rather long constructions, we will omit most of them and 
just give sketches of the proofs for Theorem [7| and Theorem 

In this section we concentrate on uniform learning with respect to fixed hy- 
pothesis spaces, i.e. according to Definition [Sj Our aim is to show that all the 
comparison results in Theorem [1] hold analogously for these concepts, even if all 
classes to be learned are finite. Lemma [T] summarizes some simple observations. 

Lemma 1. 1. suit^{J* , EXm) C suit^{J* , EXm+i) C suit^{J*,EX) for arbi- 
trary m C N, 

2. TOTAL) C suit^{J*, CONS) C suit^{J*, CONE) C suit^{J*,EX), 

3. suit^{J* , TOTAL) C suit^p{J* , CEX) C suit^{J* , EX). 

These results hold analogously if we substitute suit^ by suiU- 

Proof. All these inclusions except for suit,^(J*, TOTAL) C suit,p( J*, CONS) (or 
analogously with r instead of ip) follow immediately from the definitions. If a 
set i? C N fulfills B G suit,^( J*, TOTAL)(S') for some strategy S G V^, we can 
easily define T G such that B G suit,^( J*, CONS)(T). On input {b,f[n\) the 
strategy T just has to check the hypothesis S{b,f[n]) for consistency wrt p^. 
For b G B, f G TZb this is possible, because the function is total. If 

consistency is verified, T returns the same index as S, otherwise it returns some 
consistent hypothesis (which can be found, if / G TZb). Convergence to a correct 
hypothesis follows from the choice of S. The r-case is proved by analogy. □ 

Now we want to prove that all these inclusions are in fact proper inclusions. 
For that purpose consider Theorem first. 

Theorem 6. suit^p{J* , EXm+i) \ suit{J* , EXm) 0 for any m G N0 

Note that this result is even stronger than required. We just needed to prove 
suit(p(J*,EXm-i-i)\suit,^(J*,EXm) 0 and the corresponding statement for the 
r-case. Besides we have not only verified suit( J*, EXm-i-i)\suit( J*, EX^) ^ 0, 
but we observe a further fact: though we know uniform learning with respect 
to the hypothesis spaces given by p to be much more restrictive than uniform 
learning without special demands concerning the hypothesis spaces, we still can 
find collections of class-descriptions which are 

— restrictive enough to describe finite classes of recursive functions only, 

— suitable for uniform EXm+i-identification with respect to the hypothesis 
spaces corresponding to their descriptions, 

— but not suitable for uniform EXm-identification even if the hypothesis spaces 
can be chosen without restrictions. 

Similar strict separations are obtained by the following theorems. 

Theorem 7. suit^{J* , EX) \ suit{J* , CONE) ^ 0. 



^ The proof is omitted but proceeds similar to the proof of Theorem 0 
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Proof. We will just give a sketch of the relevant parts of the proof; details and 
formal constructions are not needed to explain the general idea common to most 
of the proofs of our results. We use a strategy T e TZ to define a description 
set i? C N suitable for uniform identification in the limit by T. The set B shall 
describe only finite recursive cores and will not be suitable for uniform conform 
identification. The choice of the strategy T may seem rather arbitrary, but it 
will enable an indirect proof. 

Define T GlZhy 



|max{/(0), . . . , /(n)} — 1 otherwise 
for arbitrary f € TZ and n G N. 

Then set B := {b G N \ TZb is finite and Kb G EX^b(T)}. 

We will prove B G suit<^( J*, EX) \ suit( J*, CONE). By definition of B we 
obviously have B G suit^( J*, EX). Now B ^ suit( J*, CONE) is verified by way 
of contradiction. 



Assumption. B G suit( J*, CONE). 

Then there is some S G such that Kb G CONF(Aa:.5'(6, a;)) for all b G B. 

Aim. Construct an integer bo, such that bo G B, but Ktg ^ CONF(Aa;.S'(6o) a;)), 
in contradiction to our assumption. The strategy Xx.S{bo, x) will fail for at least 
one function / G Kbg by either 

— changing its hypothesis for / infinitely often or 

— not terminating its computation on input of some initial segment of / or 

— violating the conformity demand on input of some initial segment of /. 

Construction of bo. We define Tjf G P^ uniformly in & G N. First we define 
?7§(0) := 0. If we set yo ■= 0, the segment ?7obo] is already defined. We start in 
stage 0. 

In general, in stage k we proceed as follows: 

For the definition of further values of rjQ one computes S{b,r]o[yk]), ?7o[j/fc]0) 

and S'(6, ? 7 §[?/fc]l). If one of these values is undefined, then ijq = 0t°°. Else, if 
these values are all equal, we append zeros until we observe that the strategy 
Xx.S{b,x) changes its mind on the initial segment constructed so far. Otherwise 
we just append one value t G {0, 1}, such that S{b,r]Q[yk]) yf 5'(6, ? 7 g[i/fc]t). 

The functions ri 2 k+i V 2 k +2 defined as follows: 

’72fc+ibfe + 2] := r]^[yk]0(2k + 2), 772 / 0 + 2 [j/fc + 2] — t?o [ y/o]l (2 fc + 3). Both func- 
tions will be extended by zeros until the values S'(6, 77g[j/fc]0) and S'(6, ? 7 g[i/fc]l) 
are computed and the definition of rjo is stopped temporarily because of a 
mind change of Xx.S{b,x) on the initial segment of r/o constructed so far (if 
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these conditions are never satisfied, we obtain ry^fc+i = Vo[yk]0{‘^k + 2)0°° and 
^ 2 fc +2 = ^obfc]l(2^+3)0°°). If the definition of is stopped temporarily, let yk+i 
be the maximal argument for which 77(5 is defined. If yk+i exists, go to stage k+1. 

The Recursion Theorem then yields an integer 6 q G N satisfying = rf'° . 

Claim. 1. We have rng(( 73 Q'’) C {0, 1}; if x G N, then max(rng(( 7 j^‘l^;^)) = x + 2 
or rng{ipl\^) = 0 . 

2. If in the construction of all stages are reached, then TZbg = {t 3 q“}. If stage 
k {k gN) is the last stage to be reached, then TZbg = '^ 2 fe+i> ‘^’ 2 ^+ 2 } 

= {^2k+l’‘fi2k+2}- 

This claim implies bo G B. For the proof of TZhg ^ CONF(Ax.S'( 6 o, x)) we 
assume by way of contradiction that TZb„ G CONF,t,(Ax.5'(6o, x)) for some num- 
bering G 7^^. By Claim E] it suffices to consider the following three cases: 

Case 1. TZbo = {<Po“}- 

Then all stages are reached in our construction. We observe that in the identifi- 
cation process for (pQ° the strategy Xx.S{bo,x) changes its hypothesis infinitely 
often. 



Case 2. TZb„ = W^k+i’ ^ 2 k+ 2 } some k gN. 

In this case we have S{bo, ^2°k+i[yk + 1 ])T or S{bo, ‘f2%+2[yk + 1 ])T (with yk as in 
our construction), so Xx.S{bo, x) cannot be successful for both <f2%+i 7 ^ 2 fc+ 2 - 



Case 3. TZb„ = , <^ 2 fc+i’ ‘^' 2 ^ 2 } fc G N. 

Then stage k is reached; stage fc -I- 1 is not reached. Furthermore 

S{bo,y>l°k+i[yk + 1 ]) = S{bo,rjo°{yk]0) = JIo” [yfc]l) = S{bo,ipll^2{yk + 1 ]) , 

although <f2°k+i[yk + 1 ] yf >fl°k+2{yk + !]• Thus i := S{bo,(ph^.^[yk + 1 ]) cannot 
be a 7 />-number for both :^2fe+i V^2fe-i-2- There are two possibilities: 

Case 3.1. V'*(y/c + 1 )T- 

Then the sequence of hypotheses produced by Ax.S'(6 q,x) on the function 
converges to an index incorrect for </?q° with respect to ip. 

Case 3.2. ipt{yk + 1)|. 

Then i is not conform for both V^2fe+i[yfc + ‘^2fe+2[yfc + V'- 

We conclude TZbo i CONF^(Ax.S'(6o, x)); thus TZbo ^ CONF(Ax.S'( 6 o, a^))- 
The properties of bo now contradict our assumption, so B ^ suit( J*, CONF). 
This completes the proof. □ 
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A separation of the criteria CONF and CONS in the uniform learning model 
can be verified with similar methods; the proof is omitted. 

Theorem 8. suit^{J*, CONF) \ suit{J*, CONS) ^ 0. 

So, the results CONS C CONF C EX can also be transferred to uniform 
learning with respect to r and the numberings given a priori by ip. Again, finite 
classes are sufficient for the separations. 

In order to prove suit,^(J*, TOTAL) C suit,^( J*, CONS) (and the same result 
for suitr) we use TheoremEl Since suitT(J*, TOTAL) C suitT( J*, CEX), we even 
obtain suit^(J*, CONS)\suitr(J*, TOTAL) ^ 0. 

Theorem 9. suit^{J*, CONS) \ suitr{J* , CEX) ^ 0. 

Proof. We will omit some formal details and concentrate on the main ideas. 
Again we use a strategy T G to define a description set B C N suitable for 
uniform consistent identification by T. Though B describes only finite recursive 
cores, it will not be suitable for uniform CEX-identification with respect to r. 



for f GTZ and 6, n e N. 

Then set B := {b G N \ TZb is finite and TZb G CONS^b{Xx.T{b, x))}. 

We will prove B G suit,^( J*, CONS) \ suitT( J*, CEX). The definitions imply 
B G suit^( J*, CONS). The claim B ^ suit^ ( J*, CEX) is verified by way of 
contradiction. 

Assumption. B G suitT-( J*, CEX), 

i.e. there is some S GNS such that TZb G CKNr{\x.S{b,x)) for any b G B. 

Aim. Construction of an integer bo G B with TZbg ^ CEXT-(Aa;.S'(&o, 2 ;)), in 
contradiction to our assumption. The strategy Xx.S{bo,x) will fail for at least 
one / G TZbg by either 

— changing its hypothesis for / infinitely often or 

— generating a hypothesis incorrect for / with respect to t for infinitely many 
initial segments of / or 

— guessing a r-number of a proper subfunction of / on input of some initial 
segment of /. 



Define T G 'P^ hy 





[aO C and aO C /]} such a minimum is found 



otherwise 
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Construction ofbo- 

Define a function ip € with the help of initial segments a^. {b, k G N) as 

follows: for arbitrary 6 G N set Og := 1 and begin in stage 0. 

In general, in stage k we proceed as follows: 
e := S{b,a’l). Start a parallel check until (i) or (ii) turns out to be true. 

(i) . There is some y < \a\\ such that Te{y) is defined and Te(y) yf ct\.{y)- 

(ii) . There is some y > \a\\ such that Te{y) is defined. 

The function V'fe+i shall have the initial segment a^O which will be extended 
by a sequence of O’s, until (i) or (ii) turns out to be true. If condition (i) turns out 
to be true first, then t/jq shall have the initial segment a\ which will be extended 
by a sequence of I’s, until Xx.S{b,x) is forced to change its mind on i/)g; then 
o^k+i shall be the initial segment of 'i/'o constructed so far. If condition (ii) turns 
out to be true first - with Te{yk)i: yk > n - then := a^I . . . l{Te{yk) + 1), 
where the last argument in the domain of is yk- In case is defined go 
to stage fc + I. If neither (i) nor (ii) is fulfilled, ipg remains initial. 

The Recursion Theorem then yields an integer 6g G N satisfying (p^° = . 

Claim. The construction in stage k implies 

1. G iff [a\^ is defined and C (C 

2. if ^ TZ and then (pQ° = G TZ and the sequence of hy- 

potheses produced by Xx.S{bQ,x) on (^g° converges to an index incorrect for 
(pg“ with respect to r, 

3. if <fk°+i i ^ furthermore 

a) S'(5o,afe“+i) ^ S{bo,al°) or 

b) S{bo, /[|o!^‘’| — 1]) is incorrect wrt r for any f GTZ satisfying C /, 

4. if <pg° G TZ, then ^ TZ for all fc G N. Furthermore 0 ^ rng((pg°). 

5. There is exactly one index i such that G TZ. 

With this claim and our construction we can verify bo G B. Now we assume 
by way of contradiction that TZb^ G CEXr{Xx.S{bo,x)). It suffices to regard two 
cases. 

Case 1. TZbo = {:pg“}. 

Then on tpg® the strategy Xx.S{bo,x) changes its hypothesis infinitely often or 
returns a hypothesis incorrect with respect to r infinitely often. We obtain TZb„ 4. 
CEXr{Xx.S{bQ,x)). 

Case 2. TZb„ = with i> 1. 

With Claim U we have = (p')° for some n G N. 

Hence S(bo,ip’)°[n\) is a r-number of a proper subfunction of . We conclude 
TZbo i CBTXr{Xx.S{bo,x)). 
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In each case we have TZbo ^ CEXt-(Ax.S'(6o, x)). As bo G B, this contradicts 
our initial assumption; so B ^ suitr( J*, CEX). This completes the proof. □ 

Thus it only remains to show that the separations in Lemma [HH] are proper 
inclusions. From TheoremEland suit^( J*, CONS) C suit<^( J*, EX) (analogously 
for suiti-) we obtain that the second inclusion suit^( J*, CEX) C suit<^( J*, EX) 
and its r- version are indeed proper. For the first inclusion regard Theorem 1101 

Theorem 10. suit^{J*, CEX) \ suit{J*, CONS) yf 0jl 

Together with suitT-(J*, TOTAL) C suitT-( J*, CONS) this theorem yields 
suity(J*, CEX)\suitT-(J*, TOTAL) 0 and in particular suit^(J*, TOTAL) C 
suitip( J*, CEX), where again suit,^ may be replaced by suitr. 

With Theorems [9] and [TO] we have also verified the following corollary. 

Corollary 1. 1. suit^{J*, CEX)=f^suit^{J* , CONS), 

2. suiU{J*,CEX)#suiU{J*,CONS). 

Now we can summarize our separation results for uniform learning of finite 
classes with respect to fixed hypothesis spaces. 

Theorem 11. 1. suit,p{J* , EX^) C suit,p{J* , EX^+i) C suit,p{J* , EX) for ar- 

bitrary m G N, 

2. suit,p{J* , TOTAL) C suit,p{J* , CONS) C suit,p{J*, CONE) C suit,p{J* , EX), 

3. suit<^{J* , TOTAL) C suit,p{J* , CEX) C suit,p{J* , EX). 

These results hold analogously if we substitute suit,^ by suiU- 

Thus we have transferred the comparison results of Theorem [T|to the concept 
of meta-learning in fixed hypothesis spaces. Each separation is achieved already 
by restricting ourselves to the synthesis of strategies for finite classes of recursive 
functions. 

5 Separation of Inference Criteria — General Hypothesis 
Spaces 

In this section we investigate the hierarchies of inference criteria for uniform 
learning without restrictions in the choice of the hypothesis spaces. Again we will 
concentrate on description sets corresponding to collections of finite classes of 
recursive functions. Some of the comparison results in Section 4 hold analogously 
for this concept, but there are differences, too. Our first simple observations in 
Lemma [2] follow immediately from the definitions. 

Lemma 2. 1. suit{J* , EXm) C suit{J* , EXm+i) C suit{J* , EX) for all m G N, 

2. suit{J* , CONS) C suit{J* , CONE) C suit{J*,EX), 

3. suit{J*, TOTAL) C suit{J*, CEX) C suit{J*, EX). 



^ The proof is omitted but proceeds similar to the proof of Theorem [?1 
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Note that we dropped the inclusion for TOTAL-identification in the second 
line. Since in general a uniform strategy S satisfying B G suit( J*, TOTAL)(S') 
for some i? C N can not synthesize an appropriate hypothesis space for TZi, from 
b G B, the hypotheses returned by S cannot be checked for consistency. Therefore 
the proof of Lemma [T| cannot be transferred. By Theorems [7l [8] all inclusions 
in Lemma mn and I2I2I are proper inclusions. But for the other separations we 
observe a different connection, as Theorem [Testates. 



Theorem 12. suit{J*, TOTAL) = suit{J*, CEX) = suit{J*,EX). 



Proof. suit(J*, TOTAL) C suit(J*,CEX) C suit(J*,EX) follows by definition. 
It remains to prove suit(J*,EX) C suit(J*, TOTAL). For that purpose fix a 
description set B G suit (J*, EX). Then we know 

1. TZb is finite for ell b G B, 

2. there is a strategy S G V'^ such that for any b G B there is a hypothesis 
space G satisfying TZi, G EK^[b]{Xx.S{b,x)). 

Note that the hypothesis spaces do not have to be computable uniformly 
in b. Now we want to prove that B G suit(J*, TOTAL). We even will see that 
our given strategy S is already an appropriate strategy for uniform TOTAL- 
identification from B. This requires a change of the hypothesis spaces for 
bG B. 

Idea. Assume b G B was fixed. Since Xx.S{b,x) identifies the finite class TZb in 
the limit, there are only finitely many initial segments of functions in TZb which 
force the strategy Xx.S{b, x) into a “non-total” guess. If we replace the functions 
in associated with these non-total guesses by an element of TZ (for example 
0°°), we obtain a hypothesis space appropriate for TOTAL-identification of TZb 
by Xx.S{b, x). 

More formally: Fix b G B. From |2] we obtain card {n G N | if^gjb /[n]) ^ ^ 

for all / G TZb. Defining the set of “forbidden” hypotheses on “relevant” initial 
segments by 

LfW := {f G N I V-f' iTZA3f GTZbBnGN [S{b, f[n]) = i]} , 

we conclude with statement[T] that is finite. Now we define a new hypothesis 
space 77 [^1 by 





if i i 
if i G 



for alH G N . 



Since G 'P^ and is finite, is computable. The definition of 77^^! 
then implies TZb G TOTAL^[6] (Ax.S'(6, a;)). As b G B was chosen arbitrarily, we 
conclude B G suit(J*, TOTAL). □ 

With this result we also observe a difference to our separation of TOTAL 
and CONS in the classical learning model. 
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Corollary 2. suit{J*, CONS) C suit{J*, CONF) C suit{J*, TOTAL). 

Proof. This fact follows immediately from Theorem 0and Theorem |8]and by the 
result suit(J*,EX) = suit(J*, TOTAL) in Theorem [12] □ 

Obviously, a further change in the hierarchies of inference criteria is wit- 
nessed by the fact suit( J*, CONS) C suit( J*, CEX), which follows by the same 
argumentation as in the proof of Corollary |21 We summarize: 

Theorem 13. 1. suit{J*^EXm) C suit{J* , EXm+i) C suit{J*,EX) for arbi- 
trary m G N, 

2. suit{J*, CONS) C suit{J*, CONE) C suit(J*, TOTAL) = suit{J*, CEX) = 
suit{ J* , EX) . 

So in contrast to uniform identification of finite classes with respect to fixed 
hypothesis spaces the separations in Theorem ID cannot be transferred to the un- 
restricted concept of uniform learning. Still it is remarkable, how many inference 
criteria for uniform identification can be separated by collections of finite classes 
of functions - even with very strong results (cf. the remarks below Theorem E|). 
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Abstract. We consider inductive language learning and machine dis- 
covery from examples with some errors. In the present paper, the error 
or incorrectness we consider is the one described uniformly in terms of a 
distance over strings. Firstly, we introduce a notion of a recursively gen- 
erable distance over strings, and for a language L, we define a fc-neighbor 
language L' as a language obtained from L by (i) adding some strings 
not in L each of which is at most k distant from some string in L and 
by (ii) deleting some strings in L each of which is at most k distant from 
some string not in L. Then we define a fc-neighbor system of a base lan- 
guage class as the collection of fc-neighbor languages of languages in the 
class, and adopt it as a hypothesis space. We give formal definitions of 
fc-neighbor (refutable) inferability, and discuss necessary and sufficient 
conditions on such kinds of inference. 



1 Introduction 

In the present paper, we consider inductive language learning and machine dis- 
covery from examples with some errors. Inductive inference is a process of hy- 
pothesizing a general rule from examples. As a correct inference criterion for 
inductive inference of formal languages and models of logic programming, we 
have mainly used Gold’s identification in the limit [8] . An inference machine M 
is said to identify a language L in the limit, if the sequence of guesses from M, 
which is successively fed a sequence of examples of L, converges to a correct 
expression of L. In this criterion, a target language, whose examples are fed to 
an inference machine, is assumed to belong to a hypothesis space which is given 
in advance. However, this assumption is not appropriate, if we want an inference 
machine to infer or to discover an unknown rule which explains examples or 
data obtained from scientific experiments. That is, the behavior of an inference 
machine is not specified, in case we feed examples of a target language not in 
the hypothesis space in question. 

In their previous paper, as a computational logic of machine discovery, Muk- 
ouchi and Arikawa mm focused refutability of the hypothesis space concerned, 
and discussed both refutability and inferability from examples. That is, for ev- 
ery target language, if it is a member of the hypothesis space concerned, then 
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an inference machine should identify the target language in the limit, other- 
wise it should refute the hypothesis space itself in a finite time. They showed 
that there are some rich hypothesis spaces that are refutable and inferable from 
complete examples (i.e., positive and negative examples, or an informant), but 
refutable and inferable classes from only positive examples (i.e., text) are very 
small. In relation to refutable inference, Sato m discussed general conditions 
for a class to be refutably inferable from complete examples. Lange and Watson 
[I13j and Mukouchi jl8J also proposed inference criteria relaxing the requirements 
of inference machines, and Jain |10| also deals with the problem for recursively 
enumerable languages. On the other hand, Mukouchi m and Kobayashi and 
Yokomori mi also proposed inference criterion requiring an inference machine 
to infer an admissible approximate language within the hypothesis space con- 
cerned, even when the target language is not in the hypothesis space. 

In many real-world applications of machine discovery or machine learning 
from examples, we have to deal with incorrect examples. In the present pa- 
per, we consider language learning from observed incorrect examples together 
with correct examples, i.e., from imperfect examples. When we are consider- 
ing language learning from complete examples, i.e., from positive and negative 
examples, some positive examples may be presented to the learner as negative 
examples, and vice versa. It is natural to consider that each observed incorrect 
example has some connection with a certain correct example on a target language 
to be learned. The incorrect examples we consider here are the ones described 
uniformly in terms of a distance over strings. Assume that the correct example 
is a string v and the observed example is a string w. In case we are considering 
the so-called Hamming distance and two strings v and w have the same length 
but differ just one symbol, then we estimate the incorrectness as their distance 
of one. In case we are considering the edit distance and w can be obtained from 
V by deleting just one symbol and inserting one symbol in another place, then 
we estimate the incorrectness as their distance of two. Mukouchi and Sato m 
introduced a notion of a recursively generable distance over strings, and defined 
fc-neighbor closure of a language L as the collection of strings each of which is 
at most k distant from some string in L. Then they discussed inferability of a 
fc-neighbor closure of a language in the hypothesis space from positive examples. 

There are various approaches to language learning from incorrect examples 
(cf. e.g. Jain [ 0 |, Stephan m, and Case and Jain 0 ). Stephan m has formu- 
lated a model of noisy data, in which a correct example crops up infinitely often, 
and an incorrect example only finitely often. There is no connection between 
incorrect examples considered there and correct examples. 

In the present paper, for a language L, we define a fc-neighbor language L' 
as a language obtained from L by (i) adding some strings not in L each of which 
is at most fc distant from some string in L and by (ii) deleting some strings in 
L each of which is at most fc distant from some string not in L. Formally, a 
language L' is a fc-neighbor language of L, if L' is a subset of the fc-neighbor 
closure of L and L'° is a subset of the fc-neighbor closure of where L‘^ is the 
complement of L. Then we define a fc-neighbor system of a base language class 
as the collection of all fc-neighbor languages of languages in the class, and adopt 
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it as a hypothesis space. We consider refutability and inferability of a fc-neighbor 
system from complete examples and present some conditions on language classes 
to be refutable and inferable from complete examples. 



2 Preliminaries 



2.1 A Language and a Distance 

Let A be a fixed finite alphabet. Each element of E is called a constant symbol. 
Let A+ be the set of all nonnull constant strings over E and let E* = U {e}, 
where e is the null string. A subset L of E* is called a language. The length of 
a string w £ E* is denoted by |?c|. For n £ N, E^ denotes the set of all strings 
whose length is n, and if-" denotes the set of all strings whose length is at most 
n, that is, if" = {ic G if* | |i(;| = n} and if-" = {ic G if* | |r<;| < n}. 

A language L C if* is said to be recursive^ if there is a computable function 
f : E* ^ {0, 1} such that f{w) = 1 iff w G L for w G if*. 

We consider a distance between two strings defined as follows: 

Definition 1. Let N = {0, 1, 2, • • •} be the set of all natural numbers. 

A function d : E* x E* ^ N U {oo} is called a distance over strings, if it 
satisfies the following three conditions: 

(i) For every v,w £ E* , d{v, w) = Q iff v = w. 

(ii) For every v,w £ E* , d{v,w) = d{w,v). 

(Hi) For every u,v,w £ E* , d{u, v) + d{v, w) > d{u, w). 

A distance d is said to be recursive, if there is an effective procedure that 
computes d{v,w) for every v,w £ E* with d{v,w) oo. 



Then we define the fc-neighbor closure of a language as follows: 

Definition 2 (Mukouchi and Sato |19|L Let d : E* x E* ^ N U {oo} be a 
distance over strings and let k £ N . 

The A:-neighbor closure of a string w £ E* w.r.t. d is the set of all 

strings each of which is at most k distant from w, that is, = {v £ E* \ 

d{v, w) < k}. 

The fc-neighbor closure L of a language L C if* w.r.t. d is the set of 
all strings each of which is at most k distant from some string in L, that is, 
= {?; G if* I G Ls.t. d{v,w) < fc}. 



By the definition, we see that {w} 



yj(rf.o) c C c • • • and 



The following lemma is obvious: 

Lemma 1. Let d be a distance, and let k £ N . 

For a language L C if* and for a string w £ E* , w £ L if and only if 
^(d,k) p 7^ (/). 



For a set S, we denote by US' the cardinality of S. A procedure is said to 
generate a finite set S, if the procedure enumerates all elements in S and then 
stops. 
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Definition 3 (Mukouchi and Sato [l9j). A distance d is said to have finite 
thickness, if for every w G S* , is finite. 

A distance d is said to be recursively generable, if d has finite thickness and 
there exists an effective procedure that on inputs k G N and w G S* generates 



We note that the notion of a recursively generable finite-set- valued function 
was introduced by Lange and Zeugmann m- 

Example 1. (1) We consider a distance known as the Hamming distance. For a 
string w and for a number i with 1 < f < |tc|, by let us denote the i-th 
symbol appearing in w. For two strings v,w G S*, let 

_ / tt{* I 1 < * < kL v[i] yf w[i]}, if |u| = \w\, 

if|u|y^|u;|. 

Clearly, this distance d is recursively generable. 

(2) Next, we consider a distance known as the edit distance. Roughly speak- 
ing, the edit distance d over two strings v,w G S* is the least number of editing 
steps needed to convert v to w. Each editing step consists of a rewriting step of 
the form a — > £ (a deletion), e ^ b (an insertion), or a — > 6 (a change), where 
a,b G E. 

Clearly, this distance d is recursively generable. 

Let d be a recursively generable distance, and let k G N. Then, for every 
v,w G E*, by checking v G whether d{v,w) < k or not is recursively 

decidable. Therefore d turns to be a recursive distance. Let L C 27* be a recursive 
language. Then, for every w G 27*, by checking yf cj), whether w G 

or not is recursively decidable. Therefore is also a recursive language. 

In the present paper, we exclusively deal with a recursively generable distance, 
and simply refer it as a distance without any notice. 

2.2 Inferability from Examples 

We briefly introduce the basic notions necessary for defining our framework of 
neighbor inference. 

Definition 4 (Angluin [2]). A class C = {LijigAr of languages is said to 
be an indexed family of recursive languages, if there is a computable function 
f : N X E* {0, 1} such that f{i, w) = 1 iff w G Li. 

In the present paper, we adopt an indexed family of recursive languages as 
a base hypothesis space. 

Definition 5 (Gold |^). A complete presentation, or an informant, of a lan- 
guage L C E* is an infinite sequence (u>o, uq), (wi, ui), • • • G 27* x {0,1} such 
that {wi \ i G N, Vi = 1} = L and jtCi | i G TV, = 0} = (= 27* \ L). 

In what follows, a or S denotes a complete presentation, and a[n] denotes 
the a ’s initial segment of length n G TV. For a eomplete presentation a and for 
n G N , we put cr[n]“'' = {wi \ {wi, 1) G cr[n]} and <j[n]~ = {wi \ {wi, 0) G cr[n]}. 
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An inductive inference machine {IIM, for short) is an effective procedure, or 
a certain type of Turing machine, which requests inputs from time to time and 
produces natural numbers from time to time. An inductive inference machine 
that can refute hypothesis spaces {RUM, for short) is an effective procedure which 
requests inputs from time to time and either (i) produces natural numbers from 
time to time forever or (ii) refutes the class and stops in a finite time after 
producing some natural numbers. The outputs produced by the machine are 
called guesses. 

For an IIM M or an RUM M, for a complete presentation a and for n G N, 
by M(cr[n]) we denote the last guess or the refutation sign produced by M which 
is successively presented examples in a[n] on its input requests. 

An IIM M or an RUM M is said to converge to a number i for a complete 
presentation a, if there is an n G TV such that for every m > n, M(cr[m]) = i. 
An RUM M is said to refute a class C from a complete presentation cr, if there 
is an n £ such that M{a[n]) is the refutation sign. In this case we also say 
that M refutes the class C from a[n]. 

Then we define the ordinary inferability of a class of languages as follows: 

Definition 6 (Gold p]). Let C = be a class of languages. 

An IIM M is said to infer a language G C in the limit from complete 
examples, if for every complete presentation a of Li, M converges to an index 
j for a such that Lj = Li. 

An IIM M is said to infer a class C in the limit from complete examples, if 
for every Li G C, M infers Li in the limit from complete examples. 

A class C is said to be inferable in the limit from complete examples, if there 
is an IIM which infers L in the limit from eomplete examples. 

In the definition above, the behavior of an inference machine is not specified, 
when we feed a complete presentation of a language which is not in the class 
concerned. 

Definition 7 (Mukouchi and Arikawa |15ll7| h An RIIM M is said to 
refutably infer a class C from complete examples, if it satisfies the following 
condition: For every L C S* , (i) if L G C, then M infers L in the limit from 
complete examples, (ii) otherwise M refutes C from every complete presentation 
of L. 

A class C is said to be refutably inferable from complete examples, if there 
is an RIIM which refutably infers C from complete examples. 

Now, we introduce our successful learning criterion we consider in the present 
paper. 

Definition 8. Let d be a distance, and let k G N . 

A language L' C E* is said to be a fc-neighbor language of a language L C E* 
w.r.t. d, if L' C and L"^ C 

The set of all k-neighbor languages of L w.r.t. d is denoted by . The 

fc-neighbor system of a class C = {Li}igAr w.r.t. d is the collection of all 

k-neighbor languages of languages in L w.r.t. d, that is, = UieAf 
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(d,fc) 

In the definition above, we note that L' C L 
only if C L' C 



and L'° C 



-r^(d,k) 



if and 



Example 2. (1) Let E = {o, b}, let d be the Hamming distance, and let k = 2. For 
a string w G FI*, let us denote by o{w, h) the number of h's appearing in w. We 
consider the language L = {w G E* \ 3 < o{w, b) < 7}. Then = {ru S F7* | 

1 < o{w, b) < 9}. On the other hand, = {w G E* \ o{w, b) < 3 or o(w, b) > 7}, 

and thus = {li; g if* | o{w,b) yf 5} and = {w € Ff* | o{w,b) = 

5}. Therefore L' G if and only if {w g Ff* | o{w,b) = 5} C L' C {w G 

I 1 < o{w,b) < 9}. 

(2) Let E = {a}, let d be the edit distance, and let k = 1. We consider the 

language L = {a, aaa, • • • , • • •}. As easily seen, = ^s, a, aa, aaa, • • • , 

a”,---} = E*. On the other hand, L‘^ = {e, aa, aaaa, • • • , a^”, • • •}, and thus 

= 4>. Therefore consists of all languages over 

f:. 

(3) Let d be an arbitrary distance, and let k G N. We consider the language 

L = E*. As easily seen, E ’ = E*. On the other hand, = <f>, and thus 

= (j) and Therefore = {L}. 

In a similar way, we see that = {L'}, where L' = <j). 

1 . 7 - . .1 Y — {d,k) 

We note that m case L is a recursive language, L and are also 

recursive languages, while L' G is not a recursive language in general. 

Furthermore neither the class nor is indexable in general. 



Definition 9. Let C = be a class of languages, let d be a distance, and 

let k G N . 

For a language L C E* , a pair (i,j) G N x N is said to be a weak fc-neighbor 
answer for L, if j < k and L G ^ 

For a language L C E* , a pair (i,j) G N x N is said to be a fc-neighbor 
answer for L, if (i) (i,j) is a weak k-neighbor answer for L and (ii) for every 
pair {i',f) with f < j, L ^ [LiY'^d') ^ 

An IIM M is said to ^-neighborly (resp., weak ^-neighborly j infer a class 
C w.r.t. d from complete examples, if for every L g and every complete 

presentation a of L, M converges to a number (i,j) for a such that (i,j) is 
a k-neighbor (resp., weak k-neighbor) answer for L, where (•,•) represents the 
Cantor’s pairing function. 

A class C is said to be /c-neighborly (resp., weak fc-neighborly j inferable 
w.r.t. d from complete examples, if there is an IIM which k-neighborly (resp., 
weak k-neighborly) infers L w.r.t. d from complete examples. 

A class C is said to be neighborly (resp., weak neighborly^ inferable w.r.t. 
d from complete examples, if for every k G N , L is k-neighborly (resp., weak 
k-neighborly) inferable w.r.t. d from complete examples. 

We also omit the phrase ‘w.r.t. d’, if it holds for every distance d. 



Furthermore, we take the refutability of the class into consideration as fol- 
lows: 




Refutable Language Learning with a Neighbor System 273 



Definition 10. Let d be a distance, and let k G N. 

An RUM M is said to fc-neighbor-refutably (resp., weak fc-neighbor- 
refiitably^ infer a class £ w.r.t. d from complete examples, if it satisfies the 
following condition: For every L C S* , (i) if L G [£]^‘^’^^, then M k-neighhorly 
(resp., weak k-neighborly) infers L from complete examples, (ii) otherwise M 
refutes £ from every complete presentation of L. 

A class £ is said to be A:-neighbor-refutably (resp., weak fc-neighbor- 
refutablyj inferable w.r.t. d from complete examples, if there is an RUM which k- 
neighbor-refutably (resp., weak k -neighbor-re futably) infers £ w.r.t. d from com- 
plete examples. 

A class £ is said to be neighbor-refutably (resp., weak neighbor-refutably^ 
inferable w.r.t. d from complete examples, if for every k G N , C is k-neighbor- 
refutably (resp., weak k -neighbor-refutably) inferable w.r.t. d from complete ex- 
amples. 

We also omit the phrase ‘w.r.t. d’, if it holds for every distance d. 

The rest of this section is devoted to summarize some known results related 
to this study. 

Since we are considering an indexed family of recursive languages, the fol- 
lowing theorem is valid: 

Theorem 1 (Gold [S]). Every class £ is inferable in the limit from complete 
examples. 



Definition 11 (Mukouchi and Arikawa |15U17] 1. A pair (T,F) of subsets 
of S* is said to be consistent with a language L, ifTCL and F C L‘^. 

The econs function e for a class £ is the function such that for two finite sets 
T,F C E*, 



e{T,F) 



1, if there exists an L G C 

such that (T, F) is consistent with L, 
0, otherwise. 



Definition 12 (Mukouchi [11 4| ) . A pair {T,F) of finite subsets of E* is said 
to be a pair of definite finite tell-tale sets of a language L within a class £, if (ij 
(T,F) is consistent with L and (ii) (T,F) is inconsistent with every L' G C with 
L' ^ L. 

Theorem 2 (Mukouchi and Arikawa |15ll7j i. A class £ is refutably infer- 
able from complete examples, if and only if it satisfies the following two condi- 
tions (Cl) and (C2): 

(Cl) The econs function for £ is computable. 

( C2) For every L ^ C, there is a pair of definite finite tell-tale sets of L 
within £. 

The condition (C2) above means that for every L ^ C, there is a pair {T, F) 
of finite subsets of E* such that (i) (T, F) is consistent with L and that (ii) 
(T, F) is inconsistent with every language in £. 
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3 Inferability 

3.1 Characterizations 

The following theorem can be shown by a simple enumerative method: 

Theorem 3. For every k € N and every distance d, every class is weak k- 
neighhorly inferable w.r.t. d from complete examples, and thus every class is 
weak neighborly inferable from complete examples. 

Proof. Let d be a distance, and let k G N. We consider the algorithm in Figure 

m 



Procedure IIM M 
begin 

let To := (j) and Fq := </>; 
let n := 0 and i := 0; 
repeat 

let n := n + 1; 

read the next example (w,v); 

if V = 1 then let := r„_i U {w} and F„ := F„-i 
else let T„ ~ T„-i and F„ ~ F„-i U {w}; 
while T„ g or F„ g do i := i + 1- 

output {i,k)\ 

forever; 

end. 



Fig. 1. An IIM which weak fc-neighborly infers a class w.r.t. d from complete examples 



It is easy to see that the algorithm weak /c-neighborly infers every class w.r.t. 
d from complete examples. □ 

On fc-neighbor inferability, the following theorem is valid: 

Theorem 4. Let d be a distance, and let k G N. 

A class C is k-neighborly inferable w.r.t. d from complete examples, if and 
only if C satisfies the following condition (C3): 

(C3) There is a computable function f which satisfies the following condition: 
For every L G and every complete presentation a of L, there is an n G N 

such that for every m> n, f{cr[m]) = k' , where k' G N is the least number such 
thatLG 

Proof. The ‘only if’ part is obvious. 

The ‘if’ part. We assume that there is a computable function / which satisfies 
the condition above. Then we consider the algorithm in Figure |2] 
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Procedure IIM M 
begin 

let To ;= 4> Ei'iid Fo ;= (j)\ 
let 5 be the empty sequence; 
let n ~ 0; 
repeat 

let n := n + 1; 

read the next example (w,v); 
let (5 5 • (w, v); 

if u = 1 then let := T„_i U {ui} and F„ := Fn-i 
else let T„ ;= T„_i and := F„-i U {w}; 
let k' ■- f{5); 

search for the least index i < n such that T„ C ^ and F„ C ^ 

if such an index i is found then output (i, k') 
else output {n,k)\ 

forever; 

end. 



Fig. 2. An IIM which fc-neighborly infers a class w.r.t. d from complete examples 



Let L G [£] , and we assume that a complete presentation cr of L is fed to 

the procedure. Let kg be the least number k such that L G and then let 

{ig,kg) be the /c-neighbor answer for L such that for every i < ig, {i,kg) is not a 
fc-neighbor answer for L. Then it is easy to see that the algorithm converges to 
{io,kg)- 

Therefore the algorithm /c-neighborly infers C w.r.t. d from complete exam- 
ples. □ 

Corollary 1. Let d be a distance, and let k G N . 

If a class C is {k + l)-neighborly inferable w.r.t. d from complete examples, 
then C is also k-neighborly inferable w.r.t. d from complete examples. 

In a similar way to Mukouchi and Arikawa [T5[T7] , we can show the following 
theorem: 

Theorem 5. Let d be a distance, and let k G N . 

A class L is weak k-neighbor-refutably inferable w.r.t. d from complete exam- 
ples, if and only if C satisfies the following two conditions (C4) and (C5): 

(C4) The econs function for the class is computable. 

(C5) For every L ^ ^ there is a pair of definite finite tell-tale sets of 

L within the class 

Theorem 6. Let d be a distance, and let k G N . 

A class L is k-neighbor-refutably inferable w.r.t. d from complete examples, 
if and only if C satisfies the conditions (C3), (C4) and (C5). 
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Theorem 7. Let d be a distance, and let k G N. 

If a class L satisfies the following two conditions (C4’) and (C5’), then L 
also satisfies the condition (C3), and thus C is k-neighhorly inferable w.r.t. d 
from complete examples: 

( C4 ’) For every k' < k, the econs function for the class ^ is computable. 

(C5’) For every k' < k and every L ^ \ there is a pair of definite 

finite tell-tale sets of L within the class 

Proof. Assume that a class C satisfies the conditions (C4’) and (C5’). 

For k' < k, let Ck' be the econs function for the class \ Then we 

construct an algorithm for computing f(a[n\) as follows: 

(i) Search for the least k' < k such that efc/(cr[n] + , cr[n]“) = 1. 

(ii) If such an index k' is found then output k' , otherwise output k. 

Then it is easy to see that the algorithm witnesses the condition (C3). □ 

By Theorems El and [7] the following corollary is valid: 

Corollary 2. Let d be a distance, and let k G N . 

If a class L satisfies the conditions (C4), (C5), (C4’) and (C5’), then L is 
k-neighbor-refutably inferable w.r.t. d from complete examples. 

Furthermore, by Theorem |5] and Corollary |2] the following corollary is valid: 

Corollary 3. Let d be a distance. 

A class C is neighbor-refutably inferable w.r.t. d from complete examples, if 
and only if C is weak neighbor-refutably inferable w.r.t. d from complete exam- 
ples. 

On the other hand, by Theorems [2] and |7] the following corollary is valid: 

Corollary 4. If a class C is refutably inferable from complete examples, then C 
is also 1-neighborly inferable from complete examples. 



3.2 Some Other Conditions 

In the previous section, we showed that every class is weak neighborly inferable 
from complete examples. On neighbor inferability, there is a class that is not 
neighborly inferable from complete examples. 

Theorem 8. Let k > 1. 

There is a class C and a distance d such that C is not k-neighborly inferable 
w.r.t. d from complete examples. 



Example 3. We consider the class TC of all finite languages over A. Then, as 
easily seen, for every distance d and every k G N , the class also consists 

of all finite languages. 
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(1) For every distance d and every k G N, the class TC is fc-neighborly 
inferable w.r.t. d from complete examples, that is, the class TL is neighborly 
inferable from complete examples. 

(2) For every distance d and every k G N, the class TC is not (weak) k- 
neighbor-refutably inferable w.r.t. d from complete examples. 

This is because there is no pair of definite finite tell-tale sets of an infinite 
language within C. 

For a set T C r*, let us put = {T' C | T C 

Lemma 2. Let d be a distance, and let k G N , let C he a class of languages, 
and let T, F C S* be two sets such that T F = (f>. 

There exists an L G [73] such that {T,F) is consistent with L, if and only 
if there exists an L' G C, a T' G and an F' G such that 

{T' ,F') is consistent with L' . 



Proposition 1. Let d be a distance. 

If the econs function for a class L is computable, then for every k G N , the 
econs function for the class is also computable. 

Proof. Assume that e is the computable econs function for a class 73. 

Let k G N, and let T, F be two finite subsets of F* . Then we see that 
is a finite class of finite subsets of E*, and so is X'^'^'^\F). Thus 
we see by Lemma that we can recursively decide whether or not there is an 
L G [73] such that (T,F) is consistent with L by checking T D F = (j> and 
e{T',F') = 1 for some (T',P') e x 

Therefore the econs function for the class [73] is computable. □ 

On the other hand, the converse is not valid in general. 

Proposition 2. There is a distance d and a class C such that for every k >1, 
the econs function for the class [73]^'^’^^ is computable and that the econs function 
for C is not computable. 

Proof. Let ipQ,ipi,ip 2 , - ■ • be all partial recursive functions of one variable with 
acceptable numbering (cf. Rogers [ZDj), and then let ‘ ‘ be compu- 

tational complexity measures (cf. Blum |B]), that is, the following conditions 
hold: 

(i) For every i,x G N, (pi{x) is defined, if and only if <Pi{x) is defined. 

(ii) For every i,x,y G N, whether <Pi{x) < y or not is recursively decidable. 
Let E = {a, b} and let us put 

^ /{a'+i, &*+!}, if <?*(*) <J, 

otherwise, 

where (•, •) represents the Cantor’s pairing function. Then we consider the class 

73 = {LijigAT. 
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Suppose that the econs function for C is computable. Then, for every i € N, 
by testing there is an L G £ such that ({6®+^},(/)) is consistent with L, we can 
recursively decide whether or not is defined. This contradicts the halting 
problem. 

Hence the econs function for £ is not computable. 

Let A: > 1 and let d be the distance such that 



d{v, w) 



0, if u = w, 

1, if V ^ w and |u| = |w|, 

oo, otherwise. 



Then, for every i,j G N, = 11®+^ and £^. = E*, and thus 

= {£ I £ C Hence the econs function e for the class is 

such that for two finite sets T, F C E* , 



e{T,F) 



1, ifdiG A^s.t. rcr®+i and£C (r®+i)®=, 

0, otherwise, 



and thus it is computable. 



□ 



By Theorem!^ and Proposition [T] the following corollary is valid: 

Corollary 5. If a class £ is refutably inferable from eomplete examples, then 
the eeons function for £ is computable, and thus for every distance d and every 
k G N , the econs function for the class [£]^'^’^^ is computable. 

The following lemma is basic: 

Lemma 3. Let d be a distance, let k,n G N, let £i, •••,£„ C E* , and let 
L C E* be a language such that L (f U • • • U 

There is a pair (T,F) of finite subsets of E* such that (i) (T,F) is consistent 
with L and that (ii) {T,F) is inconsistent with every L' G 



Definition 13 (Mukouchi and Arikawa [I15I17J L Let £ = {Li}i^M be a 
class of languages, and let S be a subclass of C. 

A set I Q N of indices is said to be a cover-index set of S, if the collection 
of all languages each of which has an index in I is equal to S, that is, S = {Li G 
£ I A G /}. 

Proposition 3 (Mukouchi and Arikawa |15I17| L If a class £ satisfies the 
following two conditions (C6) and (Cl), then £ is refutably inferable from com- 
plete examples: 

( C6) There is an effective proeedure which on input w G E* generates a finite 
cover-index set of the subclass {Li G £ | w G Li} of C. 

( Cl) The class £ contains the empty language as its member. 
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Theorem 9. If a class L satisfies the eonditions (C6) and (Cl), then C satisfies 
the following two conditions (C4”) and (C5”)\ 

( C4”) For every distance d and every k G N , the econs function for the class 
is computable. 

(C5”) For every distance d, every k G N and every L ^ there is a 

pair of definite finite tell-tale sets of L within the class 

Proof. Assume that a class C satisfies the conditions (C6) and (C7). Then, by 
Proposition [31 and Corollary O we see C satisfies the condition (C4”) above. 

Let d be a distance, let k G N, and let L ^ By the condition (C7), 

£ contains the empty language as its member, and so does Thus L is 

nonempty, and let w G L. 

Let us put T = {V G L \ n L' ^ fi}. Then, for every L" G 

ii w G L", then L" G In fact, let L" G be a language such 

that w G L" , and let L' S £ be a language such that L" G . Since 

w G L” C we see by Lemma (D that ujC,^) n £' yf </>, and thus L' G T. 

Hence L” G holds. 

Since the distance d has finite thickness and £ satisfies the condition (C6), 
we see that T is a finite subclass of £. Appealing to Lemma|31 we see that there 
is a pair (T, F) of finite subsets of S* such that (i) (T, F) is consistent with L 
and that (ii) {T,F) is inconsistent with every L” G 

Finally, we put T' = T U {ic} and F' = F. Then, as easily seen, (T', F') is a 
pair of definite finite tell-tale sets of L within the class [£]^'^’^'^. □ 

By Theorems El Cl and O and Corollary O we have the following corollary: 

Corollary 6. Assume that a class £ satisfies the conditions (C6) and (Cl). 

(1) The class £ is weak neighbor-refutably inferable from eomplete examples. 

(2) The class £ is neighbor-refutably inferable from complete examples. 

(3) The class £ is neighborly inferable from complete examples. 



Example 4- Here, we consider the class VAT of pattern languages. 

We briefly recall a pattern and a pattern language. For more details, please 
refer to Angluin m- 

Fix a finite alphabet S. A pattern tt is a nonnull finite string of constant and 
variable symbols. The pattern language L{'k) generated by a pattern tt is the set 
of all strings obtained by substituting nonnull strings of constant symbols for 
the variables in tt. Since two patterns that are identical except for renaming of 
variables generate the same pattern language, we do not distinguish one from 
the other. We can enumerate all patterns recursively and whether w G L{tt) 
or not is recursively decidable. Therefore we can consider the class of pattern 
languages as an indexed family of recursive languages, where the pattern itself 
is considered to be an index. 

As easily seen, the empty language L = does not belong to VAT and there 
is no pair of definite tell-tale sets of L within VAT . Thus the class VAT is not 
refutably inferable from complete examples (cf. Mukouchi and Arikawa [15117] L 
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However the class VAT satisfies the condition (C6). In fact, fix an arbitrary 
constant string w. As easily seen, ii w G then tt is not longer than w. 

There is an effective procedure that on input a fixed length generates the set 
of all patterns shorter than the length, and whether w G L(tt) or not is recur- 
sively decidable. Therefore there is an effective procedure that on input w G 17+ 
generates the set {tt | w e L{-k)}. 

Let VAT' be the class of all pattern languages and the empty language. Then 
the class VAT' satisfies the conditions (C6) and (C7), and thus by Corollary |6] 
we see that the following propositions are valid: 

(1) The class VAT' is weak neighbor-refutably inferable from complete ex- 
amples. 

(2) The class VAT' is neighbor-refutably inferable from complete examples. 

(3) The class VAT' is neighborly inferable from complete examples. 

4 EFS Definable Classes 

In this section, we consider neighbor inference and neighbor-refutable inference 
of languages classes defined by elementary formal systems (EFSs, for short). 

The EFSs were originally introduced by Smullyan m to develop his recursion 
theory. In a word, EFSs are a kind of logic programming language which uses 
patterns instead of terms in first order logic [25] . and they are shown to be 
natural devices to define languages [3]. 

In this paper, we briefly review the related known results and the obtained 
results on neighbor inferability and neighbor-refutable inferability of language 
classes definable by the so-called length-bounded EFSs. For detailed definitions 
and properties of EFSs, please refer to Smullyan 123!, Arikawa (3], Arikawa et 
al. [41 5 j and Yamamoto |25j . 

For n G N, let us put be the class of languages defined by length- 

bounded EFSs with at most n axioms. 

We note that the class was introduced by Shinohara |22] as a rich 

hypothesis space inferable in the limit from positive examples. On refutable 
inferability, Mukouchi and Arikawa [1151 17j have obtained the following theorem: 

Theorem 10 (Mukouchi and Arikawa [ISllTj i. Let n G N. 

The class CBC^-"^ is refutably inferable from complete examples. 

We can show the following lemma in a similar way to Mukouchi and Arikawa 

it™ . 

Lemma 4. Let n G N . 

The class CBC^-'''^ satisfies the condition (C5”), that is, for every distance 
d, every k G N and every L ^ , there is a pair of definite finite 

tell-tale sets of L within the class . 

By Theorems 0 H] and [TU] Corollaries [21 and [21 and Lemma (H we have the 
following theorem: 
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Theorem 11. Let n G N. 

(1) The class LBC}- is weak neighhor-refutahly inferable from complete 
examples. 

(2) The class LB is neighhor-refutahly inferable from complete examples. 

(3) The class LBL'^-^^ is neighborly inferable from complete examples. 

5 Concluding Remarks 

We have introduced a notion of a fc-neighbor language and formalized fc-neighbor 
refutability and inferability of a language class from complete examples. Then 
we presented some sufficient and necessary conditions for a language class. We 
also showed that the language class definable by the length-bounded EFSs with 
at most n axioms is /c-neighbor-refutably inferable from complete examples. 

As a future work, we should clarify the relations between weak fc-neighbor- 
refutable inferability, fc'-neighbor-refutable inferability and fc"-neighbor infer- 
ability for distinct fc, k' , k" G N. As another future investigation, we can consider 
neighbor inferability from positive examples and neighbor finite inferability from 
positive examples as well as from complete examples. We will discuss the issue 
somewhere. 
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Abstract. Learning of recursive functions refutably means that for ev- 
ery recursive function, the learning machine has either to learn this func- 
tion or to refute it, i.e., to signal that it is not able to learn it. Three 
modi of making precise the notion of refuting are considered. We show 
that the corresponding types of learning refutably are of strictly increas- 
ing power, where already the most stringent of them turns out to be 
of remarkable topological and algorithmical richness. All these types are 
closed under union, though in different strengths. Also, these types are 
shown to be different with respect to their intrinsic complexity; two of 
them do not contain function classes that are “most difficult” to learn, 
while the third one does. Moreover, we present characterizations for these 
types of learning refutably. Some of these characterizations make clear 
where the refuting ability of the corresponding learning machines comes 
from and how it can be realized, in general. 

For learning with anomalies refutably, we show that several results from 
standard learning without refutation stand refutably. Then we derive hi- 
erarchies for refutable learning. Finally, we show that stricter refutability 
constraints cannot be traded for more liberal learning criteria. 



1 Introduction 

The basic scenario in learning theory informally consists in that a learning ma- 
chine has to learn some unknown object based on certain information, that is 
the machine creates one or more hypotheses which eventually converge to a more 
or less correct and complete description of the object. In learning refutably the 
main goal is more involved. Here, for every object from a given universe, the 

* A full version of this paper is available as technical report (cf. [17]). 

** Supported in part by NUS grant number RP3992710. 



N. Abe, R. Khardon, and T. Zeugmann (Eds.): ALT 2001, LNAI 2225, pp. 283—298, 2001. 
© Springer- Verlag Berlin Heidelberg 2001 




284 



S. Jain et al. 



learning machine has either to learn the object or to refute it, that is to “signal” 
if it is incapable to learn this object. This approach is philosophically motivated 
by Popper’s logic of scientific discovery, (testability, falsifyability, refutability of 
scientific hypotheses), see [31,24]. Moreover, this approach has also some rather 
practical implications. If the learning machine signals its inability to learn a cer- 
tain object, then one can react upon this inability, by modifying the machine, 
by changing the hypothesis space, or by weakening the learning requirements. 

A crucial point of learning refutably is to formally define how the machine is 
allowed or required to refute a non-learnable object. Mukouchi and Arikawa [29], 
required refuting to be done in a “one shot” manner, i.e., if after some finite 
amount of time, the machine concludes that it cannot learn the target object, 
then it outputs a special “refuting symbol” and stops the learning process for- 
ever. Two weaker possibilities of refuting are based on the following observation. 
Suppose that at some time, the machine feels unable to learn the target object 
and outputs the refuting symbol. Nevertheless, this time the machine keeps try- 
ing to learn the target. It may happen that the information it further receives 
contains new evidence causing it to change its mind about its inability to learn 
the object. This process of “alternations” can repeat. It may end in learning the 
object. Or it may end in refuting it by never revising the machine’s belief that 
it cannot learn the object, i.e., by forever outputting the refuting symbol from 
some point on. Finally, there may be infinitely many such alternations between 
trying to learn and believing that this is impossible. In our paper, we will allow 
and study all three of these modes of learning refutably. 

Our universe is the class TZ of all recursive functions. The basic learning cri- 
terion used is Ex, learning in the limit (cf. Definition 1) . We study the following 
types of learning refutably: 

RefEx, where refuting a non-learnable function takes place in the one shot 
manner described above (cf. Definition 5). 

WRefEx, where both learning and refuting are limiting processes, that is on 
every function from the universe, the learning machine converges either to a 
correct hypothesis for this function or to the refuting symbol, see Definition 6, 
(W stands for “weak”). 

RelEx, where a function is considered to be refuted if the learner outputs the 
refuting symbol infinitely often on this function (cf. Definition 7). Rel stands 
for “reliable”, since RelEx coincides with reliable learning (cf. Proposition 1). 

Note that for all types of learning refutably, every function from TZ is either 
learned or refuted by every machine learning refutably. So, it can not happen that 
such a machine converges to an incorrect hypothesis (cf. Correctness Lemma) . 

We show that the types of learning refutably are of strictly increasing power 
(cf. Theorem 3). Already the most stringent of them, RefEx, is of remarkable 
topological and algorithmical richness (cf. Proposition 3 and Corollary 9) . All of 
these learning types are closed under union, Proposition 5, where RefEx and 
WRefEx, on the one hand, and RelEx, on the other hand, do not behave 
completely analogous. Such a difference can also be exhibited with respect to 
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the intrinsic complexity; actually, both RefEx and WRefEx do not contain 
function classes that are “most difficult” to learn, while RelEx does contain 
such classes (cf. Theorems 6 and 7 ). We also present characterizations for our 
types of learning refutably. Some of these characterizations make it clear where 
the refuting ability of the corresponding learning machines comes from and how 
it can be realized, in general (cf. Theorems 12 and 13 ). 

Besides pure Ex-learning refutably we also consider Ex-learning and Re- 
learning with anomalies refutably (cf. Definitions 18 and 19 ). We show that 
many results from learning without refutation stand refutably, see Theorems 15 
and 21 . Then we derive several hierarchies for refutable learning, thereby solving 
an open problem from [ 22 ], see Corollaries 16 and 22 . Finally, we show that, in 
general, one cannot trade a stricter refutability constraint for a more liberal 
learning criterion (cf. Corollary 25 and Theorem 26 ). 

Since the pioneering paper [ 29 ] learning with refutation has attracted much 
attention (cf. [ 30 , 24 , 16 , 28 , 19 , 15 ]). 

2 Notation and Preliminaries 

Unspecified notations follow [ 33 ] . N denotes the set of natural numbers. We write 
0 for the empty set and card (S') for the cardinality of the set S. The minimum 
and maximum of a set S are denoted by min(S) and max(S), respectively. 

r], with or without decorations ranges over partial functions. If rji and 772 are 
both undefined on input x, then, we take r]i(x) = r]2{x). We say that 771 C 772 
iff for all X in the domain of 771, 771(0;) = 772(0;). We let dom(77) and rng(77), 
respectively, denote the domain and range of the partial function 77. rj(x)l and 
rj{x) =1 both denote that r]{x) is defined and 77(0;)! as well as 77(0;) =| stand for 
77(0;) is undefined. For any partial functions 77, 77' and a G N, we write 77 =“ 77' 
and 77 =* 77' iff card({o; | r]{x) yf ?7'(o;)}) < a and card({o; | t](x) yf t7'(o;)}) < 00, 
respectively. We identify a partial function 77 with its graph {(0;, 77(0;)) | x G 
dom(r/)j. 

For r G N, the r-extension of 77 denotes the function / defined as f(x) = 77(0;), 
if 0; G dom(77) and f(x) = r, otherwise. 

TZ denotes the class of all recursive functions over N. Furthermore, we set 
^0,1 = {/ I / G ™g(/) C { 0 , 1 }}. C and S, with or without decorations 
range over subsets of TZ. For C C TZ, we let C denote 7 ^ \ C. By 7 ^ we denote 
the class of all partial recursive functions over N. /, g, h and F, with or without 
decorations range over recursive functions unless otherwise specified. 

A computable numbering (or just numbering) is a partial recursive function 
of two arguments. For a numbering t/’(-,-), we use tpi to denote the function 
Xx.ip{i, x), i.e., %pi is the function computed by the program i in the numbering 7/;. 
4 ) and Q range over numberings. 7^.0 denotes the set of partial recursive functions 
in the numbering ip, i.e., 'P^ = {ipi | i G N} and TZ^ = {ipi | i G N & 7/>j G TZ}. 
That is, 7?.0 stands for the set of all recursive functions in the numbering ip. A 
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numbering xp is called one-to-one iff xpi ^ xpj for any distinct i, j. By cp we denote 
a fixed acceptable programming system (cf. [33]). We write pi for the partial 
recursive function computed by program i in the i^-system. By (p we denote 
any Blum [ 6 ] complexity measure associated with p. We assume without loss of 
generality that Pi{x) > x, for all i,x. 

C C TZ is said to be recursively enumerable (abbr. r.e.) iff there is an r.e. set 
X such that C = {pi \ i € X}. For any r.e. class C yf 0, there is an / S 7^ such 
that C = {pf(i) I i € N}. 

A function g is called accumulation point of a class C C 7^ iff ^ G 7^ and 
(Vn G N)(3/ G C)[(Vx < n)[g{x) = f{x)] k, f ^ g]. Note that g may or may not 
belong to C. For C C 7?., we let Acc(C) = {5 | 5 is an accumulation point of C}. 

The quantifier stands for all but finitely many. The following function and 
class are considered below. Zero is the everywhere 0 function, and FINS UP = 
{f \ f G TZ k {y°°x)[f{x) = 0]} is the class of all functions of finite support. 



2.1 Function Learning 

We assume that the graph of a function is fed to a machine in canonical order. 
For a partial function rj with g{x)l for all x < n, we write g[n] for the set 
{{x,g{x)) I X < n}, the finite initial segment of 77 of length n. We set SEG = 
{/N I / G 7^ & n G N} and SEGq,i = {f[n\ \ f G 7^o,i k n G N}. We let a, r 
and 7, with or without decorations range over SEG. A is the empty segment. 
We assume a computable ordering of the elements of SEG. 

Let jcrj denote the length of cr. Thus, |/[n]| = n, for every total function 
/ and all n G N. If jcrj > n, then we let a[n] denote {(x, cr(a;)) | x < n}. An 
inductive inference machine (IIM) M is an algorithmic device that computes 
a total mapping from SEG into N (cf. [13]). We say that M(/) converges to i 
(written: M(/)J, = i) iff (V°°n)[M(/[n]) = i]\ M(/) is undefined if no such i 
exists. Now, we define several criteria of function learning. 

Definition 1 ([13,5,10]). Let a G N U {*}, let f G TZ and let M be an IIM. 

(a) M Ex“-fearns / (abbr. / G Ex“(M)) iff there is an i with M(/)| = i and 

=“ /• 

(b) M "EyF -learns C iff M Ex“-learns each f G C. 

(c) Ex“ = {C C 7^ I (3M)[C C Ex“(M)]}. 

Note that for a = 0 we omit the upper index, i.e., we set Ex = Ex°. 

By the definition of convergence, only finitely many data of / were seen by 
an IIM up to the (unknown) point of convergence. Hence, some learning must 
have taken place. Thus, we use identify, learn and infer interchangeably. 

Definition 2 ([2,10]). Let a G N U {*}, let f G TZ and let M be an IIM. 

(a) M Bc“-fearns / (written: / G Bc“(M)) iff (V°°n)[i73]v[ (/[«]) =“ /]• 

(b) M -learns C iff M Bc“-learns each f G C. 

(c) Bc“ = {C C 7^ I (3M)[C C Bc“(M)]}. 
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We set Be = Bc°. Harrington [10] showed that TZ G Be*. Thus, we shall 
consider mainly Be“ for a G N in the following. 

Definition 3 (Minieozzi [27], Blum and Blum [5]). Let M be an IIM. 

(a) M is reliable iff for all / G 7^, M(/)J, ^ M Ex-identifies /. 

(b) M RelEx-m/ers C (written: C C RelEx(M)) iff M is reliable and M Ex- 
infers C. 

(c) RelEx = {C C 7^ I (3M)[M RelEx-infers C]}. 

Thus, a machine is reliable if it does not converge on functions it fails to 
identify. For references on reliable learning besides [27,5], see [21,14,22,8]. 

Definition 4. NUM = {C \ {3C' | C C C' C TZ)[C' is recursively enumerable]}. 

Inductive inference within NUM has been studied, e.g. in [13,3]. For the 
general theory of learning recursive functions, see [1,5,10,11,23,18]. 



2.2 Learning Refutably 

Next, we introduce learning with refutation. We consider three versions of refu- 
tation based on how the machine is required to refute a function. First we extend 
the definition of IIM by allowing it to output a special symbol T. Thus, now an 
IIM maps SEG to N U {T|. Convergence of an IIM on a function is defined as 
before (but now a machine may converge to a number f G N or to T) . 

Definition 5. Let M be an IIM. M RefEx-zdentz/ies a class C (written: C C 
RefEx(M)) iff the following conditions are satisfied. 

(a) C C Ex(M). 

(b) For all / G Ex(M), for all n, M(/[n]) yf T. 

(c) For all / G 7^ such that / ^ Ex(M), there exists an n G N such that 
(Vm < n)[M(/[m]) yf T] and (Vm > n)[M(/[m]) = T]. 

The following generalization of RefEx places less restrictive constraint on 
how the machine refutes a function. WRef below stands for weak refutation. 

Definition 6. Let M be an IIM. M WRefEx-Zearns a class C (written: C C 
WRefEx(M)) iff the following conditions are satisfied. 

(a) C C Ex(M). 

(b) For all / G 7^ such that / ^ Ex(M), M(/)| = T. 

For weakly refuting a function /, an IIM just needs to converge to T. Before 
convergence, it may change its mind finitely often whether or not to refute /. 
Another way an IIM may refute a function / is to output T on / infinitely often. 

Definition 7. Let M be an IIM. M Y{,e\E,yi -identifies a class C (written: C C 
RelEx'(M)) iff the following conditions are satisfied. 

(a) C C Ex(M). 
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(b) For all / G 7^ such that / ^ Ex(M), there exists infinitely many n G N such 
that M(/[n]) = _L. 

Proposition 1. RelEx = RelEx'. 

As it follows from their definitions, for any of the learning types RefEx, 
WRefEx and RelEx, we get that any f G TZ has either to be learned or to be 
refuted. This is made formally precise by the following Correctness Lemma. 

Lemma 1 {Correctness Lemma). Let I G {RefEx, WRefEx, RelEx}. For 

any C C TZ, any IIM M with C C I(M), and any f G TZ, if M(/){ G N, then 
Fm{f) = /• 

3 Ex-Learning Refutably 

We first derive several properties of the defined types of learning refutably. We 
then relate these types by their so-called intrinsic complexity. Finally, we present 
several characterizations for refutable learnability. 



3.1 Properties and Relations 

First, we exhibit some properties of refutably learnable classes. These properties 
imply that the corresponding learning types are of strictly increasing power. 
Already the most stringent of these types, RefEx, is of surprising richness. 
In particular, every class from RefEx can be enriched by including all of its 
accumulation points. This is not possible for the classes from WRefEx and 
RelEx, as it follows from the proof of Theorem 3. 

Proposition 2. For all C G RefEx, C U Acc(C) G RefEx. 

Proof. Suppose C G RefEx as witnessed by some total IIM M. Let g GlZhe an 
accumulation point of C. We claim that M must Ex-identify g. Assume to the 
contrary that for some n, M(g[n]) = T. Then, by the definition of accumulation 
point, there is a function f G C such that g[n] C /. Hence M(/[n]) = T, too, a 
contradiction to M RefEx-identifying C. I 

The next proposition shows that RefEx contains “topologically rich”, 
namely non-discrete classes, i.e. classes which contain accumulation points. Thus, 
RefEx is “richer” than Ex-learning without any mind change, since any class be- 
ing learnable in that latter sense may not contain any of its accumulation points 
(cf. [25]). More precisely, RefEx and Ex-learning without mind changes are set- 
theoretically incomparable; the missing direction follows from Theorem 14 below. 

Proposition 3. RefEx contains non-discrete classes. 

The following proposition establishes some bound on the topological richness 
of the classes from WRefEx. 
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Definition 8. A class C C 7^ is called initially complete iff for every cr G SEG, 
there is a function f G C such that cr C f . 

Proposition 4. WRefEx does not contain any initially complete class. 

The following result is needed for proving Theorem 3 below. 

Lemma 2. C = {/ G 7^ | (Vx G N)[/(x) yf 0]} ^ Ex. 

We are now ready to prove that RefEx, WRefEx and RelEx, respectively, 
are of strictly increasing power. 

Theorem 3. RefEx C WRefEx C RelEx. 

Proof. RefEx C WRefEx C RelEx by their definitions and Proposition 1. 

We first show that WRefEx \ RefEx yf 0. For that purpose, we define 
SEG+ = {/[n] |/G7^&nGN&(VxG N)[f{x) ^ 0]}. Let C = {O-ext(cr) | ct g 
SEG'*'}. Then Acc(C) = {/ G 7^ | (Va; G N)[/(x) yf 0]}, which is not in Ex, by 
Lemma 2. Thus, C U Acc(C) ^ Ex, and hence, C ^ RefEx, by Proposition 2. 

In order to show that C G WRefEx, let prog G 7^ be a recursive function 
such that for any a G SEG"*", prog(cr) is a i^a-program for O-ext(a). Let M be 
defined as follows. 



r T, if f[n] G SEG+; 

^(/N) = S prog(CT), if 0-ext(/[n]) = O-ext(cr), for some a G SEG'*'; 

[ T, otherwise. 

It is easy to verify that M WRefEx-identifies C. 

We now show that RelEx \ WRefEx y^ 0. FINS UP is initially complete and 
FINSUP G NUM. Since NUM C RelEx, see [27], we have that FINSUP G 
RelEx. On the other hand, FINSUP ^ WRefEx by Proposition 4. | 

As a consequence from the proof of Theorem 3, we can derive that the types 
RefEx, WRefEx and RelEx already differ on recursively enumerable classes. 

Corollary 4. RefEx n NUM C WRefEx n NUM C RelEx n NUM. 

We next point out that all the types of learning refutably share a pretty rare, 
but desirable property, namely to be closed under union. 

Proposition 5. RefEx, WRefEx and RelEx are closed under finite union. 

RelEx is even closed under the union of any effectively given infinite se- 
quence of classes (cf. [27]). The latter is not true for both RefEx and WRefEx, 
as it can be seen by shattering the class FINSUP into its subclasses of one 
element each. 



3.2 Intrinsic Complexity 

There is another field where RefEx and WRefEx, on the one hand, and RelEx, 
on the other hand, behave differently, namely that of intrinsic complexity. The 
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intrinsic complexity compares the difficulty of learning by using some reducibility 
notion, see [12]. With every reducibility notion comes a notion of completeness. A 
function class is complete for some learning type I, if this class is “most difficult” 
to learn among all the classes from I. As we show, RefEx and WRefEx do not 
contain such complete classes, while RelEx does. 

Definition 9. A sequence P = po,pi,... of natural numbers is called Ex- 
admissible for / € 7^ iff P converges to a program p for /. 

Definition 10 (Rogers [33]). A recursive operator is an effective total mapping, 
O, from (possibly partial) functions to (possibly partial) functions such that: 

(a) For all functions ri,r]', if p C 77 ' then 0{ri) C 0{r]'). 

(b) For all rj, if (x,y) G 0(ji), then there is a finite function a C rj such that 
(x,y) e 0(a). 

(c) For all finite functions a, one can effectively enumerate (in a) all (x, y) G 
0 (a). 

For each recursive operator 0, we can effectively find a recursive operator 0' 
such that 

(d) for each finite function a, 0 '(a) is finite, and its canonical index can be 
effectively determined from a, and 

(e) for all total functions /, 0'{f) = 0 (/)- 

This allows us to get a nice effective sequence of recursive operators. 

Proposition 6. There exists an effective enumeration, 0o,0i,-- - of recursive 
operators satisfying condition (d) above such that, for all recursive operators 0, 
there exists on f G N satisfying 0{f) = 0i(f) for all total functions f. 

Definition 11 (Ereivalds et al. [12]). Let S,C G Ex. Then S is called Ex- 
reducible to C (written: S <ex C ) iff there exist two recursive operators 0 and 
S such that for all / G 5, 

(a) 0(/) G C, 

(b) for any Ex-admissible sequence P for 0(/), ^(P) is Ex-admissible for /. 

If S is Ex-reducible to C, then C is at least as difficult to Ex-learn as S 
is. Indeed, if M Ex-learns C, then S is Ex-learnable by an IIM that, on any 
function / G 5, outputs S'(M(0(/))). 

Definition 12. Let I be a learning type and C C TZ. C is called Ex-complete in 
I iff C G I, and for all 5 G I, 5 <ex C . 

Theorem 5. Let C G WRefEx. Then there exists a class S G RefEx such that 
S ^Ex C. 

Theorem 5 immediately yields the following result. 

Theorem 6. (1) There is no Ex-complete class in RefEx. 

(2) There is no Ex-complete class in WRefEx. 

In contrast to Theorem 6 , RelEx contains an Ex-complete class. 

Theorem 7. There is an Ex-complete class in RelEx. 
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3.3 Characterizations 



We present several characterizations for RefEx, WRefEx and RelEx. The first 
group of characterizations relates refutable learning to the established concept 
of classification. The main goal in recursion theoretic classification can be de- 
scribed as follows. Let be given some finite (or even infinite) family of function 
classes. Then, for an arbitrary function from the union of all these classes, one 
has to find out which of these classes the corresponding function belongs to, see 
[4,37,35,34,9]. What we need in our characterization theorems below will be clas- 
sification where only two classes are involved in the classification process, more 
exactly, a class together with its complement; and semi-classification which is 
some weakening of classification. Note that the corresponding characterizations 
using these kinds of classification are in a sense close to the definitions of learning 
refutably. Nevertheless, these characterizations are useful in that their charac- 
teristic conditions are easily testable, i.e. they allow to check, whether or not a 
given class is learnable with refutation. 

Let 7^0,? be the class of all total computable functions mapping N into {0, ?}. 

Definition 13. S C 77. is finitely semi- classifiable iff there is c G 77o,? such that 

(a) for every f G S, there is an n G N such that c(/[nj) = 0, 

(b) for every / G 5 and for all n G N, c(/[nj) = ?. 

Intuitively, a class 5 C 77 is finitely semi-classifiable if for every f G S after 
some finite amount of time one finds out that / G 5, whereas for every f G S, 
one finds out “nothing”. 

Theorem 8. For any C C 77, C G RefEx iff C is contained in some class 
S G Ex such that S is finitely semi-classifiable. 

Proof. Necessity. Suppose C G RefEx as witnessed by some total IIM M. Let 
S = Ex(M). Clearly, CCS. Furthermore, (i) for any f G S and any n G N, 
M(/[n]) yf T, and (ii) for any f G S, there is n G N such that M(/[nj) = T. 
Now define c as follows. 



c(/M) 



0, ifM(/[n]) = T; 

?, ifM(/[n])y^T. 



Clearly, c G 77o,? and S is finitely semi-classifiable by c. 

Sufficiency. Suppose C C S C Ex(M), and S is finitely semi-classifiable by 
some c G 77o.?. Now define M' as follows. 



M'(/[nj) 



M(/[nj), ifc(/[n])=?; 

T, if c{f[x]) = 0, for some x <n. 



It is easy to verify that M' RefEx-identifies C. 



I 



We can apply the characterization of RefEx above in order to show that 
RefEx contains “non-trivial” classes. Therefore, let 



C = {/ I / G 77 & ¥>y( 0 ) = / & (Vx G < fix + 1)]}. 
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Clearly, C G Ex and C is finitely semi-classifiable. Hence, by Theorem 8, C is 
RefEx-learnable. C ^ NUM was shown in [38], Theorem 4.2. Hence, we get 
the following corollary illustrating that RefEx contains “algorithmically rich” 
classes, that is classes being not contained in any recursively enumerable class. 

Corollary 9. RefEx \ NUM yf 0. 

We now characterize WRefEx. Therefore, we need the special case of clas- 
sification where the classes under consideration form a partition of TZ. 

Definition 14 ([37]). (1) Let C,S C 7^, where C n 5 = 0. (C,5) is called 
classifiable iff there is c G 7^o,i such that for any f G C and for almost all 
n G N, c(/[n]) = 0; and for any f G S and for almost all n G N, c(/[n]) = 1. 

(2) A class C CTZ is called classifiable iff (C,C) is classifiable. 

Theorem 10. For any C C TZ, C G WRefEx iff C C S for a classifiable class 
S G Ex. 

Proof. Necessity. Suppose C G WRefEx as witnessed by some total HM M. Let 
S — Ex(M). Clearly, CCS and S G Ex. Now define c as follows. 



c(/W) 



0, ifM(/[n])y^T; 

1, ifM(/[n]) = T. 



Then, clearly, S is classifiable by c. 

Sufficiency. Suppose C C S C Ex(M), and let S be classifiable by some 
c G 7^0.1- Then, define M' as follows. 



M'(/[n]) 



M(/[n]), ifc(/[n]) = 0 
T, if c(/[n]) = 1. 



Clearly, M' witnesses that C G WRefEx. 



I 



Finally, we give a characterization of RelEx in terms of semi-classifiability. 
Definition 15 ([35]). S C 7^ is semi-classifiable iff there is c G TZq7 such that 

(a) for any f G S and almost all n G N, c(/[n]) = 0, 

(b) for any f G S and infinitely many n G N, c(/[n]) = ?. 

Thus, a class S of recursive functions is semi-classifiable if for every function 
f G S, one can find out in the limit that / belongs to S, while for any g G TZ\S 
one is not required to know in the limit where this function g comes from. 

Theorem 11. For all C C TZ, C G RelEx iff C C S for a semi-classifiable class 
S G Ex. 

Proof. Necessity. Suppose C G RelEx by some total HM M. Let S = Ex(M). 
Clearly, C C 5. In order to show that S is semi-classifiable, define c as follows. 

r( f fnli = / if n = 0 or M(/[n - 1]) = M(/[n]); 

^ ( ?, if n > 0 and M(/[n — 1]) yf M(/[n]). 
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Now, for any f G S, M(/)|, and thus c(/[n]) = 0 for almost all n € N. On 
the other hand, if / G 5 then / ^ Ex(M). Consequently, since M is reliable 
and total, we have M(/[n — 1]) yf M(/[n]) for infinitely many n G N. Hence 
c(/[n]) = ? for infinitely many n. Thus, S is semi-classifiable by c. 

Sufficiency. Suppose C C S C Ex(M). Suppose S be semi-classifiable by 
some c G 7^0,?- Define M' as follows. 






Now, for any / G 5, for almost all n, c(/[n]) = 0. Hence M' will Ex-learn /, since 
M does so. If / G 5, then c(/[n]) = ? for infinitely many n. Consequently, M' 
diverges on / caused by arbitrarily large outputs. Thus, M' RelEx-learns C. | 



There is a kind of “dualism” in the characterizations of RefEx and RelEx. 
A class is RefEx-learnable if it is contained in some Ex-learnable class having 
a complement that is finitely semi-classifiable. In contrast, a class is RelEx- 
learnable if it is subset of an Ex-learnable class that itself is semi-classifiable. 

The characterizations of the second group, this time for RefEx and RelEx, 
significantly differ from the characterizations presented above in two points. 
First, the characteristic conditions are stated here in terms that formally have 
nothing to do with learning. Second, the sufficiency proofs are again constructive 
and they make clear where the “refuting ability” of the corresponding learning 
machines in general comes from. For stating the corresponding characterization 
of RefEx, we need the following notions. 

Definition 16. A numbering ip is strongly one-to-one iff there is a recursive 
function d: N x N ^ N such that for all i,j G N, f yf j, there is an a; < d{i,j) 
with 'ipi{x) yf ’4’j{x). 

Any strongly one-to-one numbering is one-to-one. Moreover, given any dis- 
tinct V'-indices i and j, the functions ipi and ipj do not only differ, but one can 
compute a bound on the least argument on which these functions differ. 

Definition 17 ([32]). A class iJ C 7^ is called completely r.e. iff {i | G 77} 
is recursively enumerable. 

Now, we can present our next characterization. 

Theorem 12. For any C C 7^, C G RefEx iff there are numberings if and g 
such that 

(1) if is strongly one-to-one and C C 7^^, 

(2) Vg is completely r.e. and TZg = 7?.^. 

By the proof of Theorem 12, in RefEx-learning the processes of learning and 
refuting, respectively, can be nicely separated. An HM can be provided with two 
spaces, one for learning, if, and one for refuting, g. If and when the “search for 
refutation” in the refutation space has been successful, the learning process can 




294 



S. Jain et al. 



be stopped forever. This search for refutation is based on the fact that the refu- 
tation space forms a completely r.e. class Vg of partial recursive functions. The 
spaces for learning and refuting are interconnected by the essential property 
that their recursive kernels, and TZg, disjointly exhaust TZ. This property 
guarantees that each recursive function either will be learned or refuted. The 
above characterization of RefEx is “more granular” than the one of RefEx by 
Theorem 8. The characterization of Theorem 8 requires that one should find out 
anyhow if the given function does not belong to the target class. The charac- 
terization of Theorem 12 makes precise how this task can be done. Moreover, 
the RefEx-characterization of Theorem 12 is incremental to a characterization 
of Ex, since the existence of a numbering with condition (1) above is necessary 
and sufficient for Ex-learning the class C (cf. [36]). Finally, the refutation space 
could be “economized” in the same manner as the learning space by making it 
one-to-one. 

The following characterization of RelEx is a slight modification of a result 
from [20]. 

Theorem 13. For any C C TZ, C G RelEx iff there are a numbering ip and a 
function d GTZ such that 

(1) for any f G TZ, if Hf = {i \ f[d{i)] C ipff is finite, then Hf contains a 

Ip-index of f, 

(2) for any f G C, Hf is finite. 

Theorem 13 instructively clarifies where the ability to learn reliably may come 
from. Mainly, it comes from the properties of a well-chosen space of hypotheses. 
In any such space ip exhibited by Theorem 13, for any function / from the class 
to be learned, there are only finitely many “candidates” for ■0-indices of /, the set 
Hf. This finiteness of Hf together with the fact that Hf then contains a ■0-index 
of /, make sure that the amalgamation technique [10] succeeds in learning any 
such /. Conversely, the infinity of this set Hf of candidates automatically ensures 
that the learning machine as defined in the sufficiency proof of Theorem 13 
diverges on /. This is achieved by causing the corresponding machine to output 
arbitrarily large hypotheses on every function f GTZ with Hf being infinite. 



4 Ex“-Learning and Bc“-Learning Refutably 

In this section, we consider Ex-learning and Bc-learning with anomalies 
refutably. Again, we will derive both strengths and weaknesses of refutable learn- 
ing. As it turns out, many results of standard learning, i.e. without refutation, 
stand refutably. This yields several hierarchies for refutable learning. Further- 
more, we show that in general one cannot trade the strictness of the refutability 
constraints for the liberality of the learning criteria. 

We can now define IEx“ and IBc“ for I G {Ref, WRef, Rel} analogously 
to Definitions 5, 6, and 7. We only give the definitions of RefEx“ and RelBc“ 
as examples. 
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Definition 18. Let a G N U {*} and let M be an IIM. M RefEx“-Zearns C iff 

(a) C C Ex“(M). 

(b) For all / G Ex“(M), for all n, M(/[n]) ^ _L. 

(c) For all / G 7^ such that / ^ Ex“(M), there exists an n G N such that 

(Vm < n)[M(/[m]) _L] and (Vm > n)[M(/[m]) = _L]. 

Definition 19 ([22]). Let a G N U {*} and let M be an IIM. M RelBc“ -Zearns 
C iff 

(a) C C Bc“(M). 

(b) For all / G 7^ such that / ^ Bc“(M), there exist infinitely many n G N such 
that M(/[nj) = _L. 

RelEx“ and RelBc“ were studied firstly in [21] and [22], respectively. 

Our first result points out some weakness of learning refutably. It shows that 
there are classes which, on the one hand, are easy to learn in the standard sense 
of Ex-learning without any mind change, but, on the other hand, which are 
not learnable refutably, even if we allow both the most liberal type of learning 
refutably, namely reliable learning, and the very rich type of Bc-learning with 
an arbitrarily large number of anomalies. For proving this result, we need the 
following proposition. 

Proposition 7. (a) For any a G N and any a G SEG, {/ G 7^ | ct C /} ^ Bc“. 
(b) For any a G N and any a G SEGop, {/ G 7^o,i I cr C /} ^ Bc“. 

Next, recall that Ex-learning without mind changes is called finite learning. 
Informally, here the learning machine has “one shot” only to do its learning task. 
We denote the resulting learning type by Fin. 

Theorem 14. For all a G N, Fin \ RelBc“ yf 0. 

Next we show that allowing anomalies can help in learning refutably. Indeed, 
while Ex““''^ \ Ex“ yf 0 was shown in [10], we now strengthen this result to 
Ref Ex-learning with anomalies. 

Theorem 15. For any a G N, RefEx“^^ \ Ex“ yf 0. 

Theorem 15 implies the following hierarchy results ((3) was already shown 
in [21]). 

Corollary 16. For every a G N, 

(1) RefEx“ C RefEx“+\ 

(2) WRefEx“ C WRefEx“+\ 

(3) RelEx“ C RelEx“+\ 

Now a proof similar to the proof of Theorem 15 can be used to show the 
following result. Notice that Ex* \ Ex“ yf 0 was proved in [10]. 

Theorem 17. RefEx* \ (J^gj^^Ex® yf 0. 

Theorem 15 implies further corollaries. In [10], Ex* C Be was shown. This 
result extends to all our types of refutable learning. 
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Proposition 8. For I € {Ref, WRef, Rel}, lEx* C IBc. 

In [10] it was proved that Be \ Ex* ^ 0. This result holds refutably. 

Corollary 18 . Ref Be \ Ex* ^ 0. 

The next corollary points out that already RefEx^ contains “algorithmically 
rich” classes of predicates. 

Corollary 19 . RefEx^ n % NUM n . 

Corollary 19 can be even strengthened by replacing RefEx^ with RefEx. 
This another time exhibits the richness of already the most stringent of our types 
of learning refutably. 

Theorem 20 . RefEx n 2^“-i % NUM n 2^“"i . 

Note that Theorem 20 contrasts a known result on reliable Ex-learning. If 
we require the Ex-learning machine’s reliability not only on 7?., but even on the 
set of all total functions, then all classes of recursive predicates belonging to this 
latter type are in NUM, see [14]. 

We now give the analogue to Theorem 15 for Bc“-learning rather than Ex“- 
learning. Note that Bc“"''^ \ Bc“ yf 0 was shown in [10]. 

Theorem 21 . For any a G N, RefBc“~'’^ \ Bc“ yf 0. 

Theorem 21 yields the following hierarchies, where (3) solves an open problem 
from [22]. 

Corollary 22 . For every a G N, 

(1) RefBc“ c RefBc“+\, 

(2) WRefBc“ C WRefBc“+\, 

(3) RelBc“ c RelBc“+^ 

Theorem 23 . RefBc* \ UaeN^c^ yf 0. 

In the proof of Theorem 3 we have derived that FINSUP ^ WRefEx. This 
result is now strengthened for WRefBc“-learning and then used in the next 
corollary below. 

Theorem 24 . For every a G N, FINSUP ^ WRefBc“. 

The next corollary points out the relative strength of RelEx-learning over 
WRefBc“-learning. In other words, in general, one cannot compensate a stricter 
refutability constraint by a more liberal learning criterion. 

Corollary 25 . For all a G N, RelEx \ WRefBc“ yf 0. 

Our final result exhibits the strength of WRefEx-learning over RefBc“- 
learning. Thus, it is in the same spirit as Corollary 25 above. 

Theorem 26 . For all a G N, WRefEx \ RefBc“ yf 0. 

Note that Theorems 14, 24 and 26, and Corollary 25 hold even if we replace 
Bc“ by any criterion of learning for which Proposition 7 holds. 
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Abstract. We consider, within the framework of inductive inference, 
the concept of refuting learning as introduced by Mukouchi and Arikawa, 
where the learner is not only required to learn all concepts in a given 
class but also has to explicitly refute concepts outside the class. 

In the first part of the paper, we consider learning from text and 
introduce a concept of limit-refuting learning that is intermediate be- 
tween refuting learning and reliable learning. We give characterizations 
for these concepts and show some results about their relative strength 
and their relation to confident learning. 

In the second part of the paper we consider learning from texts that 
for some k contain all positive Tlfc-formulae that are valid in the standard 
structure determined by the set to be learned. In this model, the follow- 
ing results can be shown. For the language with successor, any countable 
axiomatizable class can be limit-refuting learned from 77i -texts. For the 
language with successor and order, any countable axiomatizable class 
can be reliably learned from 77i-texts and can be limit-refuting learned 
from i72-texts, whereas the axiomatizable class of all finite sets cannot 
be limit-refuting learned from TTi-texts. For the full language of arith- 
metic, which contains in addition plus and times, for any k there is an 
axiomatizable class that can be limit-refuting learned from 77fe+i-texts 
but not from Ilk-texts. A similar result with fe -|- 3 in place of fc + 1 holds 
with respect to the language of Presburger’s arithmetic. 



1 Introduction 

Inductive Inference studies, on an abstract level, the phenomenon of 

learning. Gold [7] introduced the following basic formalization of a learning sit- 
uation. The objects to be learned are the sets within a given class of recursively 
enumerable sets. The learner has to identify each set in this class by converging 
to a hypothesis that describes the set uniquely while observing longer and longer 
prefixes of any text for this set. A learner converges if it changes its hypothesis at 
most finitely often, a text for a set is any sequence that contains all elements but 
no non-elements of the set, and usually hypotheses are indices with respect to 
some fixed acceptable numbering of the partial recursive functions (equivalently 
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one could use grammars or programs enumerating the members of the set to 
be learned). Gold [7] demonstrated that it is impossible to learn the class of all 
recursively enumerable sets. This restriction holds for topological as well as for 
recursion theoretical reasons. 

(a) For any learner that learns all finite sets and for any infinite set A, there is 
a text for A on which the learner diverges. 

(b) The class of all graphs of computable functions cannot be learned by a 
computable learner — indeed, Adleman and Blum [1] quantified the problem 
of learning this class by showing that learning the class requires access to an 
oracle of high Turing degree. 

The topological and computational aspects of learning interact. Gold [Ij con- 
sidered models of learning where in place of arbitrary texts, the learner just 
receives texts that can be computed in some fixed computation model. Gold 
showed that a computable learner can learn all recursively enumerable sets from 
primitive recursive texts (by simply identifying the primitive recursive function 
that generates the text) while, on the other hand, the collection of all recursive 
texts is already so complex that a computable learner cannot learn the class of 
all recursively enumerable sets from recursive texts. 

In the present work, the power of learners is not enlarged by restricting 
texts to computationally simple ones but by increasing their information con- 
tent. While standard texts essentially just list the elements of the set to be 
learned, we consider texts that contain positive formulae that are true for the 
set to be learned. The consideration of such more informative texts relates to 
the fact that we consider a model of learning where the learner has to recognize 
and to explicitly refute data-sequences belonging to sets that are not learned. 
This model is rather restrictive in a setting of standard texts and allows just the 
learning of classes of finite sets. The model becomes more powerful in the setting 
where the texts contain formulae and in this setting, we will investigate into the 
question which kind of classes can be learned from what types of formulae. 
Mukouchi and Arikawa [22IM1 introduced the learning model where the learner 
has to refute data-sequences that belong to sets that are not in the class to 
be learned. Their model is a sharpened version of Minicozzi’s reliable learning, 
where the learner either converges to a correct index or diverges. In the model of 
Mukouchi and Arikawa, instead of diverging, the learner has to give an explicit 
refutation signal after a finite number of steps. 

Lange and Watson Jantke and Jain HO! considered variants of 
refuting learning where a learner M for a class C of sets is not required to refute 
unless the input text T satisfies both of the following conditions. 

— M does not infer the concept to which T belongs. 

— There is a prefix a A T such that no data-sequence of any concept in C 
extends cr. 

These restrictions were meant to overcome the observation that the original 
definition of “refutable learning” of Mukouchi and Arikawa |22l23j was rather 
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restrictive. In particular, the original conditions permitted only to learn finite 
sets since, on the one hand, a learner cannot learn an infinite set A and all its 
subsets and, on the other hand, a refuting learner cannot refute any subset of a 
set it learns. 

The present work takes an alternative approach to improve the power of a 
refuting learner. First, we consider a model where the learner is only required to 
“refute in the limit” . In a context of language learning from informant, a similar 
concept has been introduced recently and independently by Jain, Kinber, Wieha- 
gen and Zeugmann m- In a context of function-learning, already Grieser [HE] 
investigated learners that refute in the limit. In his model of reflecting learning, 
however, a function / has only to be refuted in case it is incompatible with the 
class to be learned, i.e., if there is a prefix a ^ f that is not extended by any 
function in the class to be learned. Grieser m notes that with his model in 
many (but not all) cases the necessity to refute can be avoided by transition 
to a dense superclass C of the class C to be learned because by definition, C is 
learnable with reflection in the limit iff C is learnable in the limit with respect 
to the standard definition of learning. 

Second, more powerful variants of texts will be considered in order to over- 
come the restriction to classes of finite sets. This is achieved by considering a 
slightly altered form of the logical-based setting originally considered by Muk- 
ouchi and Arikawa nmsi. Informally, our approach can be summarized as fol- 
lows. 

— The learner either has to converge to an index of the set to be learned or 
to the distinguished refutation symbol “?” . It will be shown in Remark 13.51 
that this type of learning is more restrictive than reliable learning but is less 
restrictive than the model used by Mukouchi and Arikawa, where the data 
is already refuted by outputting a single refutation symbol. 

— The data-sequences are sequences of first-order sentences describing the set 
to be learned. In the special case where the data just contains the atomic 
facts that hold for the set to be learned, this is equivalent to presenting a 
standard text for the set. We will, however, also consider models where the 
data does not just contain atomic sentences but Tlfc-sentences of some given 
level k in the quantifier-alternation-hierarchy. 

More detailed accounts of inductive inference, inference in logic and of recursion 
theory in general can be found in the monographs I3I12I19I24I301 . 

Notation. For an arbitrary set A, let A* be the set of finite strings over A. 
We write N for the set of natural numbers. Unless explicitly stated otherwise, 
by the terms set and class we refer to a set of natural numbers and to a set 
of such sets, respectively. We fix a canonical indexing of the finite sets and we 
let Fi denote the finite set with canonical index i. A class C of finite sets is 
computable (is recursively enumerable) iff the set : F^ G C} is computable (is 
recursively enumerable). Observe that a non-empty class of finite sets is recur- 
sively enumerable iff it can be represented as {F^(q : i € IN} for some recursive 
function g. 
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2 Learning from Standard Texts 

Before we discuss refuting learning in Section El we shortly review some basic 
concepts and techniques from learning theory. Related to the task of refuting 
input texts, the learners considered in the following do not just output indices 
(i.e, natural numbers) but might also output a special refutation symbol. 

We fix two distinguished symbols not in IN, the pause symbol # and the refu- 
tation symbol ?. Texts and strings are infinite and finite, respectively, sequences 
over IN U {#}. The range of a text or string is the set of all elements appearing 
in it that are different from the pause symbol. We write range((r) for the range 
of a string a. A text is a text for a set A iff A coincides with the range of this 
text and hence, for example, ffff ... is the only text for the empty set. 

Definition 2.1. An unrestrieted learner is a mapping from strings to IN U {?} 
and a learner is such a mapping that is computable. A learner EX-learns or, for 
short, learns a set A iff on every text for A, the learner converges to an index 
for A. A learner learns a class iff it learns every set in the class. 

Remark 2.2. The numbers output by a learner are meant as hypothesis on 
the set to be learned with respect to some fixed indexing. In this connection, the 
usage of computable and of recursively enumerable indices is most common, i.e., 
the number i denotes the fth partial recursive function or the Ah computational 
enumerable set W^. We frequently consider the learning of classes of finite sets, 
and in this situation we might also use canonical indices as hypotheses. 

Most of the results shown below will go through if we simply require that 
the sets to be learned can be identified by an index at all and that there is an 
effective mapping from natural description of sets emerging during the learning 
algorithm to indices of these sets with respect to the indexing used. In fact, we 
will not presuppose more on the indexing used unless explicitly stated otherwise. 

In a context of learners that always output a natural number, Osherson, Stob 
and Weinstein considered learners that converge on all texts. 

Definition 2.3. [251 Section 4.6.2] A learner is confident iff on any text, the 

learner converges to a natural number. A class is confidently learnable iff it is 
learned by a confident learner. 

Standard techniques and results for confident learners extend easily to the type 
of learners considered here, which besides natural numbers might also output 
refutation symbol, if we require again that the learner has to converge - to an 
index or to the refutation symbol - on all texts. 

Remark 2.4. Any class that is learned by a learner that converges on all texts 
does not contain infinite ascending chains. In particular, the class of all finite 
sets cannot be learned by such a learner. 

For a proof by contradiction, fix a learner M that converges on all texts and 
consider an ascending chain Aq,Ai,... of sets that are all learned by M. Then 
one can inductively find strings ak such that each ak contains only elements 
from Ak and M outputs an index for Ak on input o'oo’i ■ ■ ■O’k- So M outputs on 
the text (JoO’i ... an index for each of the sets Ak and hence does not converge. 
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neither to an index for a set nor to the refutation symbol. (It it not required that 
CToCTi ... is a text for the set Aq U U • ■ h may also be a text for a subset.) 
Remark 2.5. Let Mq be a learner that converges on all texts. Then for any 
set A there is a string r that is a stabilizing sequence for Mg and A in the sense 
that for all strings 7 over ^U{#} we have Mq{t) = For a proof observe 

that, otherwise, we could construct a text for A on which Mg diverges. 

By searching for such stabilizing sequences we can construct a learner M 
that learns all sets that are learned by Mg and has in addition the following 
properties (for details of this construction see Jain et al. [T^ Proposition 5.29 on 
Page 102]). First, for any set A — including sets that are not learned or are not 
even indexed by the given indexing — the learner M converges on every text of 
A to the same value Mq(t), where t is the least stabilizing sequence for Mg and 
A (with respect to some appropriate ordering on strings). Second, any text for 
any set A has a finite prefix that is a stabilizing sequence for M and A. 

3 Refuting Learning from Standard Texts 

Next we review the definitions of the concepts of refuting and reliable learning 
that are due to Mukouchi and Arikawa and to Minicozzi |^, respectively, 
and we introduce the related concept of limit-refuting learning. While a refuting 
learner continues forever to output refutation symbols after having output a 
refutation symbol once, a limit-refuting learner might alternate between indices 
and refutation symbols in an arbitrary way before converging. 

Definition 3.1. A learner refutes a set iff on every text for this set, the learner 
first outputs at most finitely many numbers (without outputting any refutation 
symbol) and then outputs nothing but refutation symbols. A learner is refuting 
iff for any set A, either the learner refutes A or on every text for A the learner 
converges to an index for A without ever outputting ?. A learner limit-refutes 
a set iff on every text for this set, the learner converges to ?, and a learner is 
limit-refuting if it either learns or limit-refutes any set. 

A class C is refuting learnable iff there is a refuting learner that learns C. A 
class C is sharply refuting learnable iff there is a refuting learner that learns C 
and refutes every set not in C. The concept limit-refuting learnable and its sharp 
variant are defined likewise with refuting replaced by limit-refuting. 

A learner is reliable iff for any set A, the learner either learns A or has 
infinitely many mind changes on any text for A, and a class is reliably learnable 
if it is learned by a reliable learner. 

The sharply refuting learnable classes are those originally introduced by Muk- 
ouchi and Arikawa [22E3]. Observe that a class is refuting learnable iff it is a 
subclass of some sharply refuting learnable class. This follows because by defi- 
nition any refuting class is a subclass of a sharply refuting class while, on the 
other hand, any sharply refuting learnable class is refuting learnable and any 
subclass of a refuting learnable class is again refuting learnable. In Remarks 13.21 
through 13.41 we describe some features of the types of learning described in 
Definition ixn and then, in Remark 13.51 we compare their respective strength. 
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Remark 3.2. For the scope of this remark, call a class infinitely- often-refuting 
learnable iff the class is learned by a learner that for any set A, either learns A 
or outputs infinitely many refutation symbols on any text for A. Then by defini- 
tion, any limit-refuting learnable class is also infinitely-often-refuting learnable. 
Moreover, it is not so hard to show that the concepts of infinitely-often-refuting 
learning and of reliable learning coincide. 

Remark 3.3. A class of sets is reliably learnable if and only if it consists only 
of finite sets. In particular, by Remark l3.5l below. any refuting or limit-refuting 
learnable class consists only of finite sets. 

The restriction of reliable learning to classes of finite sets has already been 
observed by Osherson, Stob and Weinstein m Proposition 4. 6.1 A] and can be 
shown as follows. A learner that always outputs an index for the finite set seen so 
far learns all finite sets and is indeed reliable. Next assume that M is a reliable 
learner and let ai 02 ... be any text for an infinite set A. Then each set of the 
form {oi, . . . , ttm} is either learned by M or M diverges on any text for this set 
and consequently there must be infinitely many sets of the former or infinitely 
many sets of the latter type. But in both cases, by simply repeating elements in 
the given text, we can construct a text for A on which M does not converge to 
an index for A, that is, M does not learn A. 

Remark 3.4. Limit-refuting learnable classes do not contain infinite ascending 
chains. The assertion is immediate from Remark EH because by definition a 
limit-refuting learner converges on all texts. 

The following remark extends the observation of Mukouchi and Arikawa m 
that any refuting learnable class is also reliably learnable. 

Remark 3.5. For any class C, 

C refuting learnable => C limit-refuting learnable 

C reliably learnable, (1) 

and both implications are strict. In particular, the concepts of refuting learnable, 
limit-refuting learnable and reliably learnable class are mutually distinct. 

The first implication in 0 is immediate by definition, while the second one 
follows by Remark 13.21 Moreover, the first two concepts are separated by the 
class considered in Remark 13.91 below, while the class of all finite sets is reliably 
learnable but is not limit-refuting learnable, as follows by Remarks 13.31 and 13.41 

Remark 3.6. Reliable learners have been introduced by Minicozzi 121] by 
a slightly different formulation where on any text for a set the learner either 
learns the set or diverges. Minicozzi’s definition is apparently less restrictive 
than Definition 13. II because the former allows that a learner fails to learn a set A 
while it still converges to an index for A on some texts for A. Nevertheless, both 
definitions yield the same concept of reliably learnable class. A similar statement 
holds with respect to corresponding less restrictive definitions of refuting and 
limit-refuting learning where for example in the case of limit-refuting learning 
one just requires that on any text for a set the learner either converges to an 
index for this set or converges to the refutation symbol. Proofs can be obtained 



Refuting Learning Revisited 305 



by considering stabilizing sequences as in Remark 12.51 or constructions similar 
to the ones used for obtaining such sequences. 

Remark 3.7. Minicozzi m showed that finite unions of reliably learnable 
classes are also reliably learnable. Similar assertions hold for all variants of 



learns the union C of the classes Ci, C 2 , . . . , with respect to the same variant 
of refuting learning. 



demonstrated that in the case of unrestricted learners, the two latter properties 
can be extended to a characterization of the sharply refuting learnable classes, 
i.e., in our terms a class C is sharply refuting learnable by an unrestricted learner 
iff there are finite sets Dq , Di , . . . such that any infinite set contains some set Di 
while none of the sets Di is contained in any set in C. The following theorem is 
essentially a reformulation of the characterization of Mukouchi and Arikawa. Re- 
call in connection with the theorem that by definition a class is refuting learnable 
if and only if it is contained in a sharply refuting learnable class. 

Theorem 3.8. [23j A class C is sharply refuting learnable iff C contains only 

finite sets and there is a recursively enumerable class {Dq, Di, . . .} of finite sets 
such that C coincides with the class {X : Di ^ X for all i}. 

Theorem 13.81 implies in particular that sharply refuting learnable classes are 
closed under taking subsets. 

Proof. First assume that C is sharply refuting learnable, that is, C contains 
exactly the sets that are learned by some refuting learner M. Then C contains 
only finite sets by Remark 13.21 Moreover, the class V of all finite sets D such 
that there is a string tr over D where M{a) = ? is obviously recursively enu- 
merable and exactly the sets that are not learned by M contain some set in D. 
Conversely, given a recursively enumerable class as in the theorem and a repre- 
senting function g, a refuting learner for C is obtained as follows. On input a, the 
learner checks whether Fg(j) C range(cr) for any i < \a\. If so, the learner outputs 
a refutation symbol while, otherwise, it outputs an index for range((r). | 
Remark 3.9. Theorem 13.81 does not extend to limit-refuting learning. A 
counter-example is given by the class C of all sets of the form {0, 2, 4, ... , 2n, 
2n -I- 1}. The class C is learned by the limit-refuting learner that outputs an in- 
dex for the finite set seen so far in case this set is in C and, otherwise, outputs a 
refutation symbol. However, the set of all even numbers is infinite and any of its 
finite subsets with maximum m can be extended to the set {0, 2, 4, . . . , m, m-|- 1} 
in C, i.e., there is no sequence Dq, Di, ... as in Theorem 13.81 
The following theorem shows that also the property of limit-refuting learnable 
classes shown in Remark ED can be extended to a characterization. 



refuting learning by the following simple principle. Given k refuting learners 
Ml , M 2 , . . . , Mk for the classes € 1 , 62 , ■ ■ ■ ,Ck, the learner M given by 




M[{a) if I is the least index with Mi{a) yf ?; 
? otherwise; 



By Remarks 13.31 and 13. 4i any refuting learnable class C contains only finite sets 
and any infinite set A contains some finite set D ^ C. Mukouchi and Arikawa |2^ 
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Theorem 3.10. A class is limit-refuting learnable iff it is contained in a recur- 
sively enumerable class of finite sets that does not eontain any infinite ascending 
chain. 

Proof. Let C be a class of finite sets that is recursively enumerable with repre- 
senting function g and does not contain any infinite ascending chain. Consider 
the learner M that on input cr outputs an index for range((r) in case the latter 
set is among Fg(Q), . . . and, otherwise, outputs ?. Then, obviously, M is 

computable and learns every set in C. Moreover, M limit-refutes any set A ^C. 
In case A is finite, this is immediate by the construction of M. So assume that A 
is infinite and let ai, 02 , . . . be any text for A. As C contains no infinite chains, C 
contains only finitely many sets of the form {ai, 02, . • . , a„} and hence M out- 
puts a refutation symbol on almost all prefixes of the given text. 

In order to prove the reverse direction, assume that we are given a class Cq 
that is learned by a limit-refuting learner M. Let the class C contain all finite 
sets C such that 

for all T G (C U {#})* with |r| < |C|, 

there is 7 G (C U {#})* with M{tj) ?. (2) 

The set of all indices i such that 0 is satisfied with C replaced by F^ is re- 
cursively enumerable, that is, C is a recursively enumerable class of finite sets. 
Moreover, by construction, C contains all finite sets that are learned by M, hence 
Cq is contained in C. Assume now for a proof by contradiction that Aq, Ai, A 2 , . . . 
is an infinite ascending chain that is contained in C and let A be the union of 
the sets A^. Then by Remark 13.31 the learner M limit-refutes the infinite set 
A and hence, by Remark 12.51 there is a stabilizing sequence r for A and M 
with range(r) C A and M{tj) = ? for any string 7 with range(r) C A. Thus 
© is false for all finite subsets C of A where range(r) C C and |r| < \C\. 
Consequently, contrary to our assumption, almost all sets Ai are not in C. | 

Recall that a confident learner is a learner that always converges to a natural 
number, see Definition 12.31 Given any limit-refuting learner, according to Re- 
mark liOl this learner can be transformed into an equivalent reliable learner. 
Similarly, by replacing all refutation symbols by any fixed index, we can trans- 
form any limit-refuting learner into a confident learner that learns the same 
class, that is, any limit-refuting learnable class is also confidently learnable. The- 
orem rrm shows that the reverse implication is true for classes that are closed 
under taking subsets. 

Theorem 3.11. Let the class C be closed under taking subsets. Then the 
following eonditions are equivalent. 

(a) C is limit- refuting learnable. 

(b) C is confidently learnable. 

(c) C is learned by a refuting learner that may use the halting problem K as an 
oracle. 

Theorem 3.12. In a setting of learners that use canonical indices as hypothe- 
ses, any class is limit-refuting learnable if and only if it is eonfidently learnable. 
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4 Refuting Learning in a Logical Setting 

In the sequel, we consider learning in a logical setting, that is, the classes to 
be learned, the data, and occasionally also the hypotheses are given in terms of 
logical formulae. We will always work with a logical language £ that consists of a 
unary predicate symbol P plus a subset of the symbols {s, <, +, *, 0,1,.. .}. The 
structures considered all have domain IN and the interpretation of the symbols 
other than P is always the usual one, that is, n is interpreted as number n, s 
is the successor function, < is the usual strict order on natural numbers, and 
+ and * are interpreted as addition and multiplication over IN. We will refer to 
such structures as standard structures. The aim of the learning process is then 
to identify the interpretation of P. The logical language C will be chosen among 

B — {P, 0, 1,2,.. .}, the basic language, 

S = BU {s}, the language of successor, 

O = B U {s, <}, the language of order, 

V = Byj {s,<, +}, the language of Presburger’s arithmetic, 

U {s, <, +, *}, the language of arithmetic. 

With a language £ understood, the standard structure determined by a set A is 
denoted by M(4). In the setting considered here, a set A and the structure M(A) 
are essentially equivalent. Accordingly, we extend the notation introduced in 
connection with the learning of sets to the learning of £-structures. For example, 
given any £-structure M(A), an £-text for M(A) or, for short, a text for M.{A) is 
simply a text for the set A and a learner learns M(A) if on every text for M(A), 
the learner converges to an index for A. Moreover, given a sentence W and a set 
A, we write ’F[A] for the truth value of P in M(A). 

A class C of standard structures is called £-axiomatizable if and only if there is 
an £-sentence such that C contains exactly those standard structures in which 
the formula is true, i.e., C = {X : <F[A]}. Next we review some well-known facts 
about axiomatizable classes. 

Remark 4.1. Any 7^-axiomatizable class of finite sets is computable. For a 
proof, recall that Presburger’s arithmetic, i.e., the theory of the natural numbers 
with addition, is decidable ESEHl. As a consequence, given an index i and a V- 
formula we can effectively test whether <1> is true in the standard structure 
determined by the finite set F^ = {ui, . . . ,nm} by first replacing in <P every 
subformula of the form Pt, t a term, byt = nIV...Vt = Um , then checking 
whether the resulting formula is true in Presburger’s arithmetic. 

Remark 4.2. For any O-axiomatizable class C, the set 

Ic = {i: (3C e C) [F, C C]} 

of canonical indices of all subsets of the sets in C is computable. A proof can be 
derived by an argument similar to the one used in Remark 14.11 using Biichi’s 
result mm that the monadic second order theory of the natural numbers with 
successor and order is decidable. 
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Remark 4.3. The Theorem of Matiyasevich (see for example Smoryhski [29]) 
states that every recursively enumerable set is Diophantine and thus can be 
defined by a positive existential y^-sentence. More precisely, from an index for 
a recursively enumerable set W we can compute effectively a constant I and 
polynomials / and g in I + 1 variables and with coefficients in IN such that for 
all a: G IN, 

xew iff (3zi,...,zi e'N)[f{x,zi,...,zi) = g{x,zi,...,zi)] . (3) 

Hence the matrix of the right-hand side of 0 is an M- formula. Furthermore, 
if W is computable, then its complement is recursively enumerable and so for 
suitable polynomials f,g' as above we have for all x G IN, 

xGW iff (Vzi,...,Z; G IN) [/'(x,Zi,...,Z;) g'(x,Zi,...,Z;)j. (4) 

So we obtain a positive universal M-sentence that defines W because the subfor- 
mula /'(. . .) g'(. . .) is equivalent to /'(. . .) < g'{. . .) V g'{. . .) < /'(. . .). 

Now consider any Ilk-set A. Then A and its complement can be represented 
by formulae in prenex normal form with k — 1 alternations of quantifiers and a 
matrix that corresponds to a computable set W. By replacing these computable 
sets according to and we infer that for k even and odd, the set A can be 
represented by 

X G A (V 2 /iVj /2 ...^yh^zi... 3zi) 

[f{x, yi,...,yh,zi,...,zi) = g{x, yi, . . . ,yh, zi, . . . , zi)], 

X G A (Vj/iVj/s . . . Vy?, Vzi . . . \/zi) 

lf'{x,yi,...,yh,zi,...,zi) ^ g'{x,yi,...,yh,zi,...,zi)], 

respectively, i.e., by positive M-sentences with the same number k — 1 ot alter- 
nations of quantifiers. Furthermore, these formulae can be effectively computed 
from a recursive index of the set W and an appropriate representation of the 
quantifier prefix for the variables yi, . . . ,yh- 

Remark 4.4. Besides learners that state their hypotheses in the form of 
canonical, computable or recursively enumerable indices of sets, in the logical 
setting one can also consider learners that state their hypotheses in the form 
of logical formulae. We consider two ways of indexing sets by logical formulae, 
which might be called coinciding indices and subset indices. A formula used 
as a coinciding index is true for (and thus identifies) exactly one structure in 
the class to be learned, while for a formula used as a subset index, among all 
structures in the class to be learned that satisfy the formula there is a unique least 
structure (with respect to set theoretical inclusion), which is hence identified 
by the formula. Learning via subset indices has been considered by Martin, 
Sharma and Stephan m- Unless explicitly stated otherwise, the results shown 
in the sequel hold no matter whether we use canonical, computable, recursively 
enumerable, coinciding or subset indices, as long as all sets to be learned can be 
indexed at all by such indices. Accordingly, when stating these results we do not 
make explicit the indexing used. 
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The interplay between logic and learning has been considered before in several 
papers I4I5I6I18I151 16118119120128^ . In connection with the learning of standard 
structures, the type of texts used above are essentially equivalent to a sequence 
that contains exactly the atomic ^-sentences that are true in the structure to 
be learned, a type of data presentation considered by Shinohara |28] . 

Next we state for various models of learning that classes that are axiomatizable 
in an appropriate language and are learned by an unrestricted learner are in fact 
learnable. The corresponding proofs use Theorem 14.51 which extends the char- 
acterizations of refuting and limit-refuting learnability stated in Theorems 18.81 
and 18.101 to unrestricted learners. 

Theorem 4.5. [28| A class C has an unrestricted sharply refuting learner iff 

C contains only finite sets and there is a sequence Dq, D\, . . . of finite sets such 
that C coincides with the class {X \ Di X for all i}. 

A class C has an unrestricted limit-refuting learner iff C contains only finite 
sets and C does not contain any infinite ascending chain of finite sets. 

Theorem 4.6. Any V -axiomatizable class that has an unrestricted limit- 
refuting learner is limit-refuting learnable. 

Theorem 4.7. 

(a) Any O -axiomatizable class that has an unrestricted refuting learner is refut- 
ing learnable. 

(b) Any V -axiomatizable class that has an unrestricted refuting learner is refut- 
ing learnable by a learner that may use the halting problem K as an oracle. 

(c) There is a V -axiomatizable class that has an unrestricted refuting learner 
but has no refuting learner. 



5 Learning Axiomatizable Classes from 11^- Texts 

In this section we consider learning models where the information given in the 
data is not just a listing of all elements of the set A to be learned but in addition 
contains all formulae of a certain type that are true in M(A). Similar settings 
have been considered before, e.g., by Gasarch and Smith j^, Martin and Osher- 
son [18], Martin, Sharma and Stephan |^, and Shinohara l28l . 

Recall that a TTo-T-formula is an T-formula without quantifiers while for all 
fc > 1, a Uk-C-formula is an T- formula that consists of a quantifier prefix followed 
by a quantifier-free formula where the prefix starts with a universal quantifier 
and has at most k — 1 alternations between universal and existential quantifiers 
(e.g., for a quantifier-free £-formula <P, the formula (Va;iVa;23a;3) [^(xi, a; 2 , is 
a 7T2-£-formula) . The concept of a i7fc-£-formula is defined almost literally the 
same except that the quantifier prefix of such a formula starts with an existential 
quantifier. Recall further that a 7Tfc-£-sentence is a 77fc-£-formula that does not 
contain free variables and that an £-formula is positive iff it does not contain 
logical connectives other than V and A. 

Definition 5.1. A Uk-C-text for a set A is a sequence that, besides pause 
symbols, contains exactly all the positive ilfc-sentences that are valid in M(A). 



310 W. Merkle and F. Stephan 



For any set A, a text and a Ilo-C-text for A provide essentially the same infor- 
mation, whereas we will see below that for fc > 0, in general more classes can be 
learned from 77fc-£-texts than just from texts. 

Remark 5.2. As already observed by Martin, Sharma and Stephan [SD], there 
is no need to define A^-texts because the amount of information provided by 
a iI/c-£-text and by a Ek+i-C-text is exactly the same. For a proof, observe 
that for any ilfc-formula , Xm), in any standard structure the Sk+i~ 

formula {3xi . . . Xm) [^(a^i ■ • ■ Xm)] is true if and only if for some ni, . . . , n^, the 
iTfe-formula >F(ui , ■ ■ • , is true. 

Remarks HQ] and 13.51 imply that for any language £, just classes of finite sets 
can be refuting, limit-refuting, or reliably learned from 7To-£-text. In contrast 
to this. Theorem 15.31 shows that Il2-0-texts permit limit-refuting and reliably 
learning, respectively, of all countable O-axiomatizable classes or, equivalently, 
of all classes that for some constant n, contain only sets that are ultimately 
periodic with period n. 

Theorem 5.3. 

(a) Every countable O -axiomatizable class is limit-refuting learnable from 772- 
O-texts. 

(b) Every countable O -axiomatizable class is reliably learnable from Ili-O-texts. 

(c) Some countable O -axiomatizable class, namely the class of all finite sets, is 
not limit-refuting learnable from IIi-O -texts. 

Sketch of Proof. Due to a result of Biichi |2], all members 7 of a given O- 
axiomatizable and countable class C are ultimately periodic with a fixed length n, 
that is they have a prefix 5^ of length n or more such that L{z) = L{z — n) for 
all z not in the domain of Sl- Now let En{SL) denote the formula which states 
that all places x with Sl(x) i = 1 satisfy Px and that furthermore, for every 
y > \^l\ — n, at least j of the n places z G {y,y -\- 1, . . . ,y -\- n — 1} satisfy 
Pz. The formula En{6L) is satisfied when P is the characteristic predicate of L 
but not when P is the characteristic predicate of a proper subset of L. Now the 
following algorithm learns the class reliably from 77i-0-texts. 

Algorithm. On input 4>o, . . . , cfh find the first formula (fk such that (fk = rni^L) 
with 5l and 7 as above such that all formulas (pi are consistent with 7. If such 
a formula (pk is found, then output pk-, else output the refutation-symbol. 

In case of divergence, reliable learning means here that no hypothesis is output 
infinitely often. It can be shown that the algorithm converges to a refutation- 
symbol even in case the input-text is a Il2-0-text for a set L that is not of the 
form 6a°° for any string a G {0, 1}". On the other hand, one can show that every 
limit-refuting learner for the class of finite sets requires Il2-0-texts, while 77i- 
O-texts are not sufficient. Intuitively speaking, such a learner requires formulae 
like (Vx) (3y) [y > x A Py] in order to be able to refute texts for infinite sets. □ 

The proof of Theorem 15.41 uses a result of Thomas [32133] , according to which 
membership of a set L in an 5-axiomatizable class £ can be checked by counting, 
up to some threshold value m, for all strings of length less than or equal to some 
number n how often they appear as substring in the characteristic sequence of 7. 
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For example, consider the 5-axiomatizable class 0+l'^0°° of all sets that have 
a characteristic function of the form for some i,j > 0. For n = 2 and 

771 = 2 we then have 

L e 0+1+0°° L(0) = 0 A (occ„(L, 01) = 1 A occ„(L, 10) = 1). 

where 0 CCm(L, rf) denotes the minimum of m and the number of substrings of L 
that are equal to rj. 

Theorem 5.4. Every countable S-axiomatizahle class can be limit-refuting 
learned from Ui-S-text. 

Remark 5.5. Similarly it holds that any countable S-axiomatizable class is 
limit-refuting learnable from Ui-B-texts. The class {A : |A| < 1} is countable 
and K-axiomatizable via the formula (3a:) (Vy) [x = y V Py]. But this class is 
not learnable from standard text at all. Since, for all languages C considered 
here, UQ-C-texts are equivalent to standard texts, it follows that results like 
Theorem |5.4| cannot be improved to learnability from TTg-T-texts. 

In the case of reliable and limit-refuting learning of .4-axiomatizable classes, 
— in contrast to O-axiomatizable classes considered in the last section — for 
increasing k we can learn more and more classes from Uk-A-text. 

Theorem 5.6. For every k, the class of all -recursive sets is not reliably 
learnable from Uk-A-text but is reliably learnable from Uk+i-A-text. In partic- 
ular, for every k there is a countable A-axiomatizable class that is not reliably 
learnable from Uk-A-texts. 

Sketch of Proof. Fix k and let C be the class of all 0^-recursive sets. Recall 
that a set A is in C iff A and its complement are both Z’fc_|_i-sets or, equivalently, 
iff A and its complement are both il^+i-sets. Furthermore, given an index e of 
an oracle Turing machine that computes A relative to oracle 0^, we can compute 
representations of A and its complement as Ek+i~ and il^+i-sets and, by Re- 
m ark 1331 from these representations we can compute positive iT^+i-yl-formulae 
0g and 0\ such that for all n, 

{n ^ A iff 0Q{n)[A\) and {n G A iff Ol{n)[A\) (5) 

in case A is computed by the e-th oracle Turing machine relative to oracle 0^. 
It can be shown that there is a reliable learner N that learns the class C from 
7Tfe_|_i-^-text by syntactically analyzing the data, i.e., N checks whether certain 
formulae containing 0 q and have already appeared in the input text. 

Non-Learnability from Ilk-yl-text. It remains to show that C is not reliably 
learnable from Uk-A-text. In order to do so, we fix any learner M that is reliable 
(in the sense that for any set Y, either M converges on any Uk-A-text for Y 
to a subset index for Y or M diverges on any Uk-A-text for Y) and show that 
there is a 0^-recursive set R, i.e., R G C, such that M does not learn R. 

Let EqjEi, . . . be an enumeration of all positive 7Tfe-.4-sentences where, in 
order to simplify notation, we assume that Eq is true for all sets. Recall that 
these sentences are monotone in the sense that whenever A G_ B and is 
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true, then so is We define inductively for all a G {0, 1}* computable sets 

gIq, Ba, Ca and corresponding intervals = {X : Aa Q X C Ba}. These 
intervals are chosen such that whenever (3 extends a, then Ip is contained in /q,. 
Moreover, we will ensure for all a. 



'I'i[Aa\ = 'Pi[Ba\ for alH < |a|. 



(6) 



and hence, by monotonicity of the tf'i, all sets in agree with respect to the 
predicates 'f'o through 'H\a\- 

The inductive definition starts with the interval I\ bounded by A\ = 0 and 
B\ = IN. Then given A^ and Bq,, in order to define A^q, A^i, B^o and B^i, we 
proceed as follows. 

~ Let n = |a|. Let Ca be the union of Aa and the set containing every second 
element in Ba \ Aa, i.e., if ci,C 2 , . . . is a strictly ascending enumeration of 
the elements of Ba — Aa, then Ca = Aa U {c 2 , C 4 , . . .}. 

So we have Aa C Ca C Ba and in this chain of inclusions, any set contains 
infinitely many more numbers than its predecessors. 

— Let 

. . _ UCa,Ba) in case holds, 

^ otherwise. 

(By monotonicity of i?Vi, if the first case applies then Hn is true for all sets 
between Ca and Ba, if the second case applies then is false for all sets 

between Aa and Ca- Hence by construction, {X : Da Q X C Ea} is an 
infinite subinterval of la and all sets in this interval agree with respect to 'I'n-) 

— Let Xa be the first element in Ea \ Da- Define two infinite and disjoint 
subintervals /qo and lai of la by letting Aao = Da, Bao = Ea \ {xa\, and 
Aal Ea U { 3 ^ 0 ,}, Bal Ea- 

By construction, for all a the sets Aa, Ba and Ca are computable. Furthermore, 
for any given a, the inductive definition of these sets is computable relative to 
the oracle 0 *, hence with access to this oracle we can compute programs that do 
not use an oracle and decide the sets Aa, Ba, and Ca- Furthermore, for any set E 
there is a unique set Xp that is contained in the intersection of all classes la 
such that a is a prefix of the characteristic function of F. If we let 



Ij/F _ if for all X G lFiO)F{l)...Fin)j 

” ~ # otherwise (that is, for all X G lF{o)F{i)...F(n)) ? 

then 'Fq , I'l , ... is a iT^-M-text for Xp- Note that corresponds to the value 
of the formula iF„ [Ca] where a is equal to F(0)F'(1) . . . F{n— 1) . For any string a, 
let ta = Eq El . . . E^ where E is any extension a and the actual choice of E 
does not affect the value of ta- Then ta can be computed from a relative to 0^. 

Given any reliable learner M, there is a set E such that E and Xp are 
both computable relative to oracle 0^, but M does not learn the set Xp. Given 
any string a, the learner M cannot learn all of the uncountably many sets Xp 
with E > a and so, as M is reliable, we can pick an extension (3 a with 
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M{ta) ^ Mitjj). If we start from the empty string, we obtain a set F and a text 
, . . . for Xp on which M changes its mind infinitely often. 

By similar methods, Theorem 15.61 can be extended to the following result. 

Theorem 5.7. For every k, there is a countable A-axiomatizable class that is 
limit-refuting learnable from Ilk+i-A-text but not from Uk-A-text. 

It is possible to define multiplication in the language V of Presburger’s arithmetic 
if the language is augmented by a predicate for the square numbers. This tech- 
nique is due to Putnam m and can be used to obtain results for 7^-axiomatizable 
classes that correspond to, but are slightly weaker than Theorems 15.61 and 15.71 
Corollary 5.8. For any k, there is a countable V -axiomatizable class that 
is limit-refuting learnable from Flk+s-V-text but is not reliably learnable from 
Flk-V-text. 

Acknowledgements We like to thank Thomas Wilke for very helpful discus- 
sion about his work on 5-axiomatizable classes. Furthermore, we are grateful 
to the anonymous referees of Algorithmic Learning Theory 2001 and Theoretical 
Computer Science for their comments and corrections. 



References 

1. Leonard M. Adleman and Manuel Blum: Inductive inference and unsolvability. The 
Journal of Symbolic Logic, 56:891-900, 1991. 

2. J. Richard Biichi: On a decision method in restricted second order arithmetic. 
Proceedings of the International Congress on Logic, Methodology and Philosophy 
of Science, Standford University Press, Standford, California, 1960. 

3. Heinz-Dieter Ebbinghaus, Jorg Flum and Wolfgang Thomas: Mathematical Logic, 
Springer, 1994. 

4. William I. Gasarch, Mark G. Pleszkoch and Robert Solovay: Learning via queries 
in [-I-, <]. The Journal of Symbolic Logic, 57:53-81, 1992. 

5. William I. Gasarch and Garl H. Smith: Learning via queries, Journal of the Asso- 
ciation of Computing Machinery, 39:649-674, 1992. 

6. Clark Glymour: Inductive Inference in the limit. Erkenntnis 22:23-31, 1985. 

7. E. Mark Gold: Language identification in the limit. Information and Control, 10: 
447-474, 1967. 

8. Gunter Grieser: Reflexion in der Induktiven Inferenz. Diploma Thesis, Technische 
Hochschule Leipzig, 1996. 

9. Gunter Grieser: Reflecting Inductive Inference Machines and its Improvement by 
Therapy. Seventh Annual International Workshop on Algorithmic Learning Theory 
(ALT), Lecture Notes in Artificial Intelligence 1160:325-336, Springer, 1996. 

10. Sanjay Jain: Learning with refutation. Journal of Computer and Systems Sciences, 
57:356-365, 1998. 

11. Sanjay Jain, Efim Kinber, Rolf Wiehagen, Thomas Zeugmann: On learning of 
functions refutably, Theoretical Gomputer Science, to appear. 

12. Sanjay Jain, Daniel Osherson, James Royer and Arun Sharma: Systems that Learn, 
revised edition of |25| . The MIT Press, Gambridge, Massachusetts, 1999. 

13. Sanjay Jain and Arun Sharma: Elementary formal systems, intrinsic complexity 
and procrastination. Information and Computation, 132:65-84, 1997. 



314 W. Merkle and F. Stephan 



14. Klaus-Peter Jantke: Reflecting and self-confldent inductive inference machines. 
Sixth Annual International Workshop on Algorithmic Learning Theory (ALT), Lec- 
ture Notes in Artificial Intelligence 997:282-297, Springer, 1995. 

15. Kevin T. Kelly: The Logic of Reliable Inguiry. Oxford University Press, New York, 
1996. 

16. Kevin T. Kelly and Clark Glymour: Inductive inference from theory-laden data. 
Journal of Philosophical Logic, 21:391-444, 1992. 

17. Steffen Lange and Phil Watson: Machine discovery in the presence of incomplete 
or ambiguous data. Joint Proceedings of the Fourth International Workshop on 
Analogical and Inductive Inference (All) and of the Fifth Workshop on Algorith- 
mic Learning Theory (ALT), Lecture Notes in Artificial Intelligence 872:438-452, 
Springer, 1994. 

18. Eric Martin and Daniel N. Osherson: Scientific discovery based on belief revision. 
The Journal of Symbolic Logic, 62:1352-1370, 1997. 

19. Eric Martin and Daniel N. Osherson: Elements of Scientific Inquiry. The MIT 
Press, Cambridge, Massachusetts, 1998. 

20. Eric Martin, Arun Sharma and Frank Stephan: Learning Power and Language Ex- 
pressiveness. Forschungsbericht Mathematische Logik 49, Mathematisches Institut, 
Universitat Heidelberg, Heidelberg, 2000. 

21. Eliana Minicozzi. Some natural properties of strong-identification in inductive in- 
ference. Theoretical Computer Science, 2:345-360, 1976. 

22. Yasuhito Mukouchi and Setsuo Arikawa: Inductive inference machines that can 
refute hypothesis spaces. Fourth Annual International Workshop on Algorith- 
mic Learning Theory (ALT), Lecture Notes in Artificial Intelligence 744:123-136, 
Springer, 1993. 

23. Yasuhito Mukouchi and Setsuo Arikawa: Towards a mathematical theory of ma- 
chine discovery from facts. Theoretical Computer Science, 137:53-84, 1995. 

24. Piergiorgio Odifreddi: Classical Recursion Theory, Volumes I and II. North Holland 
and Elsevier, Amsterdam, 1989 and 1999. 

25. Daniel Osherson, Michael Stob and Scott Weinstein: Systems That Learn. An In- 
troduction to Learning Theory for Cognitive and Computer Scientists. Bradford — 
The MIT Press, Cambridge, Massachusetts, 1986. 

26. Mojzesz Presburger: Uber die Vollstandigkeit eines gewissen Systems der Arith- 
metik der ganzen Zahlen in welchem die Addition als einzige Operation hervor- 
tritt. C. R. 1®’’ Congres des Mathematiciens des Pays Slaves (Warsaw) 92-101, 
395, 1930. 

27. Hilary Putnam: Decidability and essential undecidability. The Journal of Symbolic 
Logic, 22:39-54, 1957. 

28. Takeshi Shinohara: Inductive inference of monotonic formal systems from positive 
data. New Generation Computing, 8:371-384, 1991. 

29. Craig Smoryhski: Logical Number Theory I. An Introduction. Springer, 1991. 

30. Robert I. Soare: Recursively Enumerable Sets and Degrees. A Study of Computable 
Eunctions and Computably Generated Sets. Springer, 1987. 

31. Wolfgang Thomas: Automata on Infinite Objects. Handbook of Theoretical Com- 
puter Science, Vol. B, edited by Jan van Leeuwen, p. 133-191, Elsevier, Amsterdam, 
1990. 

32. Wolfgang Thomas: Classifying regular events in symbolic logic, Journal of Com- 
puter and System Sciences, 25:360-376, 1982. 

33. Thomas Wilke: Locally threshold testable languages of infinite words. Tenth An- 
nual Symposium on Theoretical Aspects of Computer Science (STAGS), Lecture 
Notes in Computer Science 665:607-616, Springer, 1993. 




Efficient Learning of Semi-structured Data from 

Queries 



Hiroki Arimura^’^, Hiroshi Sakamoto^, and Setsuo Arikawa^ 



^ Dept, of Informatics, Kyushu University, 
Fukuoka 812-8581, Japan 

^ PRESTO, Japan Science and Technology Co., Japan 
{arim, hiroshi, arikawa}@i . kyushu-u. ac . jp 



Abstract. This paper studies the polynomial-time learnability of the 
classes of ordered gapped tree patterns (OGT) and ordered gapped forests 
(OGF) under the into-matching semantics in the query learning model 
of Angluin. The class OGT is a model of semi-structured database query 
languages, and a generalization of both the class of ordered/unordered 
tree pattern languages and the class of non-erasing regular pattern lan- 
guages. First, we present a polynomial time learning algorithm for p,- 
OGT, the subclass of OGT without repeated tree variables, using equiv- 
alence queries and membership queries. By extending this algorithm, 
we present polynomial time learning algorithms for the classes p-OGF 
of forests without repeated variables and OGT of trees with repeated 
variables using equivalence queries and subset queries. We also give 
representation-independent hardness results which indicate that both of 
equivalence and membership queries are necessary to learn p-OGT. 



1 Introduction 

Huge amount of electronic data have been available on the Web as the form of 
HTML and XML data for this decade [121 . These heterogeneous collections of 
electronic data are called semi- structured data and modeled as ordered node- 
labeled trees |1I9| . In database and network communities, there are remarkable 
attention in data mining and information extraction methods to extract useful 
information as simple patterns from these semi-structured data . However, 

there are small number of formal studies on the complexity of learning patterns 
from semi-structured data. 

In this paper, we introduce the class OGT of ordered gapped tree patterns 
(OGT) with the into-match semantics as a formal model of recently emerging 
query languages for semi-structured data mM- Then, we study its learnability 
in the exact learning model of Angluin 0. An OGT is a labeled ordered tree 
with tree and gap variables, where arbitrary trees and paths are substituted for 
the tree and the gap variables, resp., to match the pattern tree to a data tree In 
Fig. [3 we show examples of a constant tree and an OGT. We refer to an OGT 
without repeated tree variables as ^-OGT. The class of OGT is a generalization 

N. Abe, R. Khardon, and T. Zeugmann (Eds.): ALT 2001, LNAI 2225, pp. 315J333 2001. 

(c) Springer- Verlag Berlin Heidelberg 2001 
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Fig. 1. An ordered tree and an ordered gapped tree pattern, where dotted lines indi- 
cates an into-matching from the pattern to the tree. 



of the class of ordered tree patterns with the into-match semantics of Amoth, 
Cull, and Tadepalli [2] and the class of regular pattern languages [mil]. We also 
introduce ordered gapped forests OGF with the into-match semantics as sets of 
OGTs, whose semantics is defined as the unions of tree pattern languages. 

We start with analyzing the complexity of the membership problem. We show 
that the membership problems for /r-OGT and thus /r-OGF are polynomial time 
solvable. On the other hand, we show that the membership problem for OGT 
with repeated variables is NP-complete. 

As a main result, we show that the class /x-OGT, the subclass of OGT without 
repeated variables, is polynomial time learnable using equivalence and member- 
ship queries under a finite alphabet. Our algorithm LEARN-INTO-LIN-OGT 
runs in polynomial time in n using exactly one equivalence queries and O(n^) 
membership queries, where n is the size of the initial counterexample. This al- 
gorithm does not require the assumption of infinite alphabets. We also give 
representation-independent hardness results which indicate that both of equiva- 
lence and membership queries are necessary to learn /x-OGT. 

By extending the previous algorithm for /x-OGT, we present a polynomial 
time learning algorithm LEARN-INTO-OGF for the class /x-OGF of ordered 
gapped forests without repeated variables using 0{m) equivalence and 0{miT?) 
subset queries over an infinite alphabet, where m is the cardinality of a target 
forest. For this result, we develop an efficient, complete, and proper generaliza- 
tion operator for /x-OGT. Finally, we show that the full class OGT of ordered 
gapped tree patterns with repeated variables is polynomial time learnable from 
equivalence queries and subset queries by using the partition technique proposed 
by Amoth, Cull, and Tadepalli |2]. 

As summary, our results generalize the polynomial time learnability of Amoth 
et al. [ 2 ] and Matsumoto and Shinohara The into-match semantics is orig- 
inally introduced by [2] for the learning of unordered trees and forests, and well 
captures the essence of pattern matching for semi-structured data query lan- 
guages. Although there are a number of researches on the learning of tree-like 
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patterns |17I7I8I12| , most of them are based on the onto-match semantics [3] and 
far from learning of patterns like OGT and OGF. This work seems to be one 
of the first attempts to analyze the complexity of learning ordered tree patterns 
with the into-matching semantics, and hence, may provide a theoretical base 
of information extraction from Web and XML m, and will indicate possibility 
and limitations of such tasks. 

This paper is organized as follows. In Section|2l we review basic definitions on 
OGT and OGF. In Section [31 we present a polynomial time learning algorithm 
for fi-OGT with EQ and MQ. In Section (H we show that neither EQ and MQ 
can be eliminated to efficiently learn any super class of /i-OGT. In Section [5] we 
present a polynomial time learning algorithm for /r-OGF with EQ and SQ. In 
Section]^ we extend our algorithms in Section |3] and Section |5] for the full class 
of OGT with repeated variables. In Section [T] we conclude. 



2 Preliminaries 

2.1 Ordered Tree Patterns 

In this section, we first define the class OT of ordered tree patterns, and then 
the class OGT of ordered gapped tree patterns by generalizing OT. 

For a set A, denotes the cardinality of A. Let A be a set of labels. An 
ordered tree over A is a rooted, node labeled acyclic graph t = (V, E, r, £, c) with 
the following properties. V is the set of nodes and E C V xV is the set of edges. 
For each edge (m, v) G E, we say that u is the parent of v and n is a children 
of u. The node r G V is the special node without parent and called the root of 
t. All nodes but the root of t have exactly one parent. Each node is labeled by 
the labeling function t -.V ^ A. For each node m , its children ui, . . . ,Un {n> 0) 
are ordered from left to right and numbered consecutively by the numbering 
function c : E — > N. A node u G V is a, leaf if it has no children, and an internal 
node otherwise. A node rt is a chain node if it has exactly one child. 

For an ordered tree t = {V, E,r,£,c), we will use the notation Vt, Et, rt, £t, 
and Ct to denote the associated components V, E, r, £, and c. For an ordered tree 
s,t and a node u gV, t/u denotes the unique subtree of t whose root is u. We 
define the size of t, denoted by |t|, to be the number of nodes in t. We write 
s = t if they are syntactically identical with its structures and labels. 

Next, we define the class of ordered patterns. Let S = {a, 5, /, 5 , . . .} be a 
finite alphabet of constant symbols and X = {a;, y, z, . . .} be a countable alphabet 
of tree variables. An ordered tree pattern (OT) is an ordered tree over A U X, 
where each internal node is labeled by a constant symbol and each leaf is labeled 
by either a constant symbol or a tree variable. For an ordered tree pattern t, we 
denote by var(t) the set of tree variables appearing in t. A node n in t is said 
to be a constant node if £t(v) G E and a variable node if £t{v) G X. An ordered 
tree pattern is constant if it contains no tree variables. T denotes the set of all 
constant trees over S. 

We introduce two matching semantics for OT according to Amoth et al. [2]. 
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Definition 1. For OTs s and t, an into-matching from s to t is a mapping 
cj) : Vs ^ Vt satisfying the following conditions. 

(1) The mapping 4> is one-to-one. 

(2) The root of s maps to the root oft, i.e., 4>{rs) = ft. 

(3) The mapping (j) preserves the constant labels. That is, if a constant node u G 
Vs maps to V G Vt, then they have the same labels, i.e., is{u) = it{v) G E. 

(4) The mapping (j) maps the variable nodes u\, . . . ,Un G Vs {n > 2) with the 
same label x G iK. to nodes v\, ... ,Vn G Vt with the same subtrees. That is, 
t/vi = • • • = t/Vn. 

(5) The mapping (j) preserves the ordering among children. That is, if a constant 
node u G Vs with k children maps to a node v G Vt with n children, then 
k < n and there exists some 1 < ji < • ■ • < jk E n such that for every 
1 < i < k, the i-th child of u maps to the ji-th child ofv. 

A variant of the above definition without condition (2) is common in tree 
pattern matching [ig. Another matching semantics is the onto-matching [3], 
which is the matching semantics for first-order terms when OTs are restricted 
to be ranked trees |3- The onto-semantics is defined as the into-matching such 
that the numbers k and n of children must be the same in the condition (5). 

Definition 2 (Amoth et al. [ 2 ]). For OT s and t, s into-matches (bnto- 
matches, resp.) t, denoted by s t (s Oon t, resp.), if there exists an into- 
matching (j) (an onto-matching 4>, resp.) from s to t. 

Now, we define the class of ordered gapped tree patterns. Let F = 
be a countable alphabet of gap variables mutually disjoint from S and X. 

Definition 3. An ordered gapped tree pattern (OGT) is an ordered tree over 
X U X defined as follows. Each internal node is labeled by either a constant 
symbol or a gap variable. Each leaf is labeled by either a constant symbol or a 
tree variable. An internal node is labeled by a gap variable only when it is a chain 
node, i.e., a node with exactly one child. This constraint intuitively means that 
a gap variable is essentially an edge label that matches any path in a tree. There 
are no repeated occurrences of the same gap variable in the OGT. 

A node i; in t is said to be a gap node if it{v) G F. Constant and variable 
nodes are defined similarly as in OT. For an OGT t, we denote by var(t) and 
gap{t) the sets of the tree and the gap variables appearing in t, resp. We often 
use the term notation for OGT. For example, the OGT in Fig. [T] can be written 
as A{*i{B{* 2 {X), B{X),T(Y)))). Below, we give the definition of the gapped- 
matching for OGT. Note that the following definition applies to only gap nodes, 
while the condition (5) in Definition [1] applies only to constant nodes. 

Definition 4. For OGT s and t, a gapped into-matching from s to t is a map- 
ping (j) : Vs —>■ Vt satisfying the following condition in addition to the conditions 
(l)-(5) in DefinitionUlfor the into-matching for OT. 
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(6) The mapping 4> preserves the parent-child relation on gap nodes. That is, if 
a gap node u € Vg maps to v G Vt, then the unique child of u maps to a 
proper descendant of v in t. 

Definition 5. For OGT s and t, s gapped into-matches t, denoted hy s t, 
if there exists a gapped into-matching (f from s to t. 

See Fig. [T] for an example of the gapped into-matching from an OGT to a 
constant tree, where dotted lines indicate the matching. Since the gapped into- 
matching for OGT is conservative extension of the into-matching for OT, we will 
not distinguish the gapped into-match and the into-match in what follows. 

Let s,t G OGT and a G {in, on} be the underlying matching semantics. If 
s Oq, t then s is said to be a generalization of t or t is said to be an instance 
of s. If s 3a i and t 3a s hold then we define s = t and say s is equivalent 
to t. If s 3a t but t 3a s then we define s Dq t, and s is said to be a proper 
generalization of f or f is said to be a proper instance of s. 

Definition 6 (| l2irfj ). For an OGT t, the language of t under the underlying 
semantics a € {in, on} is defined hy the set of constant ordered trees 

La{t) = {w e T \ t^a w}, 

which consists of all instances oft with the semantics a. 

Note that for a constant tree t, Lin{t) may be an infinite set while Lon{t) is 
always a singleton set. 

Lemma 1. For any OGT s,t and constant trees u,w G T, the following prop- 
erties hold. 

1. The into-match relation 3in is reflexive and transitive. 

2 . s =ir, t if and only if s and t are identical modulo renaming of variables. 

3- //s 3in t, then |s| < \t\ 

4 . IfwG Lin{s), then |s| < |w| 

3. hf s 3m t, then 3 

6. If S is infinite, then s 3m t if and only if Lin (s) 12 Lin(t). 

An OGT is said to be linear if it does not have repeated occurrences of the 
same variable. We also call a linear OGT a p,-OGT. Note that OGT is always 
linear on gap variables by definition. 

We denote by OT, OGT, p-OT, and p-OGT the classes of OTs, OGTs, linear 
OTs, and linear OGTs, resp. T is the class of constant ordered trees. Since we 
will mainly consider the into-matching semantics for OGT, we may write 3 
instead of 3m if it is clear from context. We assume that all the classes OT, 
OGT, p-OT, and p-OGT contains the bottom tree T defined by Li„(T) = 0 and 
t 3in T for every OGT t. The into-matching problems for a class C of OGTs is, 
given OGT s,t G C, to determine if s 3in t holds. 

Lemma 2. The into-matching problems for p-OGT is polynomial time solvable. 
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Proof. We give a sketch of the algorithm. Removing all gap nodes in pattern s, 
we can decompose s into a set of small constant trees in T. For these trees, we can 
mark all points that some tree into-matches in 0{mn) total time by technique 
similar to where to = |s| and n = |t|. Then by dynamic programming over 
t, we can detect the matching points of all subtrees of s in 0{mn) time. This 
gives the proof. □ 



Theorem 1. The into-matching problems for OT and thus for OGT are NP- 
complete. 

Proof. It is easy to see that the problem belongs to NP. We give a log-space re- 
duction from the one-in-three SAT problem into the into-matching problem for 
OT as follows. Let F = Ci A • • • A Cm be an instance 3-CNF over a set of Boolean 
variables {vi , . . . , r>„}. Suppose that the alphabet E contains f,g,h,ci,.--, Cm, 
6,0,1 and X = {a:i, . . . , a;„}. We will define the instance (T,P) of the into- 
matching problem, where T = g(To, Ti, . . . , T^) and P = g{Po, Pi, . ■ ■ , Pm), 
as follows. First, we define a pair {To,Pq) by Tq = /(6^^^(0, 1), . . . , 6^"^(0, 1)) 
and Po = f{b^P{xi),...,b'^'^'>{xn)). Then, we know that if Pq into-matches Tq 
then the value of {xi , . . . , Xn) corresponds with an assignment in {0, 1}". Then, 
we define (Tj,Pj) for every 1 < j < to. Let 1 < j < to be any index and 
let Cj = Li^ V V Tig be the j-th clause in the 3-CNF F, where Li is ei- 
ther Xi or Xi. For every k = 1,2,3, we denote by G {Oj 1} and 0^^^ be the 
bits that make the i^-th literal true and false, resp. Then, we define Tj = 
O 5 5 ) 1*2 ) ) ^* 2 ) )) and Pj = Cj{h{xi^,Xi 2 ,Xi^)). 

For the instance (T, P) above, it is not hard to see that (6i, . . . , 6„) G {0, 1}" is 
an yes-instance of F in one-in-three SAT if and only if P into-matches T by a 
matching (j) that matches each node labeled with Xi to the node labeled with bi 
for i = 1, ... ,n. This completes the proof. □ 



2.2 Learning Model 

As a learning model, we use the exact learning model of Angluin jS], where a 
learning algorithm accesses the information on the target concept by using the 
following queries. Let t* be a target hypothesis. An equivalence query for F (EQ) 
is to propose any tree pattern t G OGT and denoted by EQ(t). The answer is yes 
if L{t) = L{t^). Otherwise, a counterexample w G {L{t^) — L{t)) U {L{t) — L{t^)) 
is returned. A counterexample w is positive if w G L(t*) and negative otherwise. 
A subset query (SQ), denoted by SQ{t), is to propose any tree pattern t G OGT, 
and receives as the answer yes if L{t) C L(t*) and no otherwise. A membership 
query (MQ), denoted by MQ{w), is to propose any constant tree w G T, and 
receives as the answer yes if w G L(t*) and no otherwise. 

The goal of an exact learning algorithm A is exact identification of the target 
hypothesis t* using making equivalence and membership queries for t*. A must 
halt and output a hypothesis t G TL that is equivalent to F, i.e., L(t) = L(fA), 
and, at any stage in learning, the running time of A must be bounded by a 
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polynomial in the size of t* and of the longest counterexample returned by 
equivalence queries so far. 

MQ and EQ for fi-OGT are polynomial time decidable under the into-match 
semantics from LemmaE] Hence, the exact learnability with EQ (and MQ, resp.) 
implies the learnability in the PAC-learning model (with MQ, resp.) and the 
prediction learning model (with MQ, resp.) for /r-OGT jS]. Unfortunately, MQ 
is NP-complete for the full class OGT under the into-match semantics as seen in 
Theorem [TJ while EQ is linear time decidable by |2] of Lemma [I] In all learning 
algorithms in this paper, the hypotheses belong to the target class. 

3 Learning Linear Ordered Gapped Tree Patterns 

In this section, we present an efficient algorithm for learning /r-OGT using equiv- 
alence and membership queries. 

Definition 7. Let s,t be OGT and 4> be any into-matching from s to t. Then, 
a node v in t is said to be an excess node w.r.t. cj) if no nodes in s maps to v. 
Otherwise, the node v is said to be a mapped node w.r.t. (f. 



Lemma 3. Let s, t be OGT such that s 3 L md (f> be any into-matching from 
s to t. If \s\ ^ \t\ then there exists at least one excess node in t w.r.t. (f. 

Proof. Since 4> : Vs —> Vt is a, one-to-one mapping, if there is no excess nodes in 
t w.r.t. (p then |s| = |t| must hold. This is the contradiction. □ 

We introduce two operations reducing the size of an OGT. Let t be an OGT 
and u S Vt be a node. Let be a leaf in t. Then, the removal of v is the operation 
that removes the leaf v and its incident edge {parent, v) G Et from t. Let v be 
a chain node in t. Then, the contraction at v is the operation that removes v 
from t and replaces a pair of incident edges {par ent, v), {v, child) G Et with 
the new edge {parent, child) in Et. In both cases, we denote by t\{u} the tree 
obtained from t by applying the operation with v to t. Let v G Vt he any node 
and aSAUXUTbe any symbol, the replacement at v with a is the operation 
to replace the label (-t{v) with a. We denote the resulting tree by t[a/v]. 

Lemma 4. Let s,t be p-OGT (t G T) such that s t. If there exists either 
a leaf or a chain node e that is an excess node in t w.r.t. some into-matching (f> 
from s to t, then the constant tree t' defined as follows satisfies s t' . 

1. For a leaf e, t' = t\{e} is obtained from t by the removal of e. 

2. For a chain node e, t' = t\{e} is obtained from t by the contraction at e. 

Proof. Suppose that : K — > Vt is any into-matching from s to t. Let t' = t\{e} 
is the tree obtained by either removal or contraction at e. Then, we show that 
(j) is still an into-match from s to t' as follows. We can see that the mapping 
(f) satisfies the conditions (1) to (5) in Definition [T] and the condition (6) in 
Definition m Since e is an excess node, i.e., e ^ (p{Vs), the conditions (1), (2), 
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Algorithm LEARN-INTO-LIN-OGT 

Given-. Oracles for EQ and MQ for target /r-OGT t*. 

Output-. A /r-OGT t equivalent to t,. 

Method-. 

1 If {EQ(IJ) = yes) then retnrn _L. 

2 Let w be a counterexample returned by EQ. /* A positive example of */ 

3 While t changes, do: 

/* The following steps properly reduce the size oi t = w * / 

(a) If MQ{w\{v}) = yes for some leaf or chain node u of w then w := w\{u}, 
where w\{u} is the constant tree obtained by either the removal or the 
contraction, resp. 

(b) t -.= w. 

4 While t changes, do: 

/* The following steps properly reduce the number of constants in t */ 

(a) If MQ{w[b/v]) = yes for some leaf v oi w and a constant label b 7^ £-w(v), 
then let x G E\var{t) be a new tree variable. 

(b) Else if MQ{w[b/v]) = yes for some chain node v of w and a constant label 
b yf iw(v), then let x € E\gap{t) be a new gap variable. 

(c) t := t[x/v\. 

5 return t; 



Fig. 2. A learning algorithm for g-OGT with the into-match semantics using equiva- 
lence queries and membership queries 



(3) are obviously satisfied. If s is /r-OGT, then the condition (4) is also satisfied. 
In the case that e is a leaf, the removal of e does not change the order of the 
sibling of e at all. Thus, the condition (5) follows. In the case that e is a chain 
node, e is on the path between (j){g) and </>(c), where 5 is a gap node and c is its 
child in s. Thus, even after the contraction at e, is still a proper descendant 
of 4>{g), and this satisfies the condition (6) in Definition [H □ 

For a pair s, t of isomorphic OGT, a pair of nodes (u, v) G VgX Vt has the 
same position if they have the same numbering with the preorder numbering in 
the depth-first search. 

Lemma 5. Let s be an OGT and t be an instance of s of the same size. Let 
tf ■ Vt ^ Vs be an isomorphism between Vt and Vg. For every node v G Vt, if 
s 3m t-io-i/v] {i = 1,2) for a pair of distinct constants a\ 7^ 02, then is a 
tree variable if v is a leaf and a gap variable if v is a gap node. 

Proof. For ever i G {1,2}, let ti = and let (pi be an into-matching from s 

to t[ai/v\. Since t\ and t 2 differ only in their labels, we have = Vt^ = Vt. If 
|s| = |t| then we know that </>i and (f >2 are the same isomorphism from Vg to Vt-. 
From the condition (3) of Definition [T] the node u cannot be a constant node. 
Thus, u is either a variable node or a gap node depending on whether it is a leaf 
or a chain node. □ 
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Theorem 2. The algorithm LEARN-INTO-LIN-OGT of Fig. 0 exactly learns 
any linear ordered gapped tree pattern in p.-OGT with the into-match seman- 
tics in polynomial time using exactly one EQ and 0(n^) MQ, where n is the size 
of the initial counterexample given hy EQ. This also holds for either finite or 
infinite alphabet E . 

Proof. If i* = _L then, the algorithm outputs t = _L, and we are done. Thus, 
suppose that ^ _L. Then, the algorithm receives a positive counterexample 
w € T of of size n > 1 such that □in w. First, we see the termination of the 
algorithm. We can easily see that whenever the first while-loop is executed, the 
size |tc| decreases by at least one, and whenever the second while-loop is executed, 
the number varff) -\- gap(t) increases by at least one while |t| is constant. Thus, 
the first and the second while-loops can be executed at most 0{n) time. 

Suppose that we enter the while-loop of step 3. The while-loop is executed 
while |rc| = \t\ > |. If |w| > |t*| then it follows from Lemma El that there exists 

at least one excess node, say v, w.r.t. cj) that is either a leaf or a chain node. 
Otherwise, there is an internal node with more than one children. We can follow 
down this edge and then repeat this process until we eventually reach a leaf or 
a chain excess node. Thus Lemma [4| shows that step 3. (a) is executed and this 
repeats until |w| = |t*| holds. 

Now, the algorithm enters the second while-loop. Suppose that w 

but w. Since |rc| = |t*|, there is a node v such that iu,(v) is constant but 

G, (u) is either a tree variable or a gap variable. Thus by Lemma|5] the algorithm 
executes one of steps 4. (a) or 4.(b) and correctly computes the updated pattern 
t' = t[x/v] such that t* □in T still holds. Therefore, this step is executed until 
the algorithm reaches t =jn t*. Hence, the theorem is proved. □ 

Corollary 1. The class fx-OGT under the into-matching semantics is polyno- 
mial time learnable in exact learning model using one positive example and MQ. 

4 Necessity of Equivalence and Membership Qneries 

In the following, we show that membership queries alone is insufficient for learn- 
ing of p-OGT. We then show that p-OGT is prediction preserving hard for DNF 
formulas. Since the latter hardness result is representation independent and the 
membership queries for p-OGT are polynomial time solvable, it also indicates 
the insufficiency of exact learning with equivalent queries alone or PAC-learning 
random examples alone assuming the hardness of DNF formulas. 

Theorem 3. Any algorithm that exactly learns all linear ordered gapped tree 
patterns in p-OGT using membership queries alone must run in 2^^"^ times in 
the worst case, where n is the size of the target hypothesis. 

Proof. Suppose that E contains three symbols 0,1, a and /. For any binary 
sequence 6i • • • G {0,1}" of length n, we have an associated constant ordered 
tree patterns 

71 

7{bU{~f{bn,a)---)) 
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over E. There are 2” such OT. Then, an adversary maintains a set S of candidate 
hypotheses as follows. Initially, S is the set of all of 2" chain trees defined above. 
Given a membership query MQ{w) with w G T, let Path{w) be the set of all 
chain trees contained in ic as a path in w. Whenever S\Path(w) is not empty, the 
adversary returns “no” and set S := S\Path{w), Since ^Path{w) is bounded 
by the size of w, the adversary can continue this process while the total size 
n = l^il of queries does not exceeds 2". Since the cost of a query of 

length m > 0 is 0(m), the time complexity of the algorithm is bounded below 
by 2^("). □ 

Theorem 4. The polynomial time prediction of the class p,-OGT is at least as 
hard as the polynomial time prediction of DNF formulas. 

Proof. It is sufficient to show that there is a prediction preserving reduction 
PT] from DNF„ to yt-OGT for every n > 0. Let d = Ti V • • • V Tm be a DNF 
formula over the set {vi, . . . , u„} of Boolean variables. Suppose that E contains 
constant symbols 0, 1, / and g. For each assignment a = oi • • • a„ G {0, 1}", let 
a = g{g{ai),...,g{an)) and b = g{g{0), g{l), . . . , g{0), g{l)) = g{{g{0), g{l))'^). 
Then, construct an instance mapping ijj and a hypothesis mapping ^ as follows: 

2m-l 

■0(a) = f{a,b,...,b,a) 

ad) = fim),---,aTm)), 

where '0(a) contains (m — 1) copies of a. and m copies of b. For each 1 < j < m, 
the subtree f{Tj) of f{d) is defined by f{Tj) = g{a\, . . . where aj = g{l) 
if the j-th term Tj contains Vj, a\ = g(0) if the j-th term Tj contains vf, and 
aj = g otherwise. Then, the following statements hold, (a) ^(Ti) into-matches d 
iff a satisfies term Tj. (b) f{Ti) always matches b. (c) d and b are only the subtrees 
that ^{Tj) into-matches. It easily follows from (a) and (b) that if a satisfies at 
least one term, say Tj, in d then f{d) into-matches 0(a). On the other hand, 
assume that ^(d) into-matches 0(a). Then using (a)-(c) and the pigeon-hole 
principle, it is not difficult to show that at least one ^{Tj) into-matches a copy 
of d. Therefore, a satisfies Tj and thus the whole formula d. Combining above 
arguments, we have that a satisfies d iff f(d) into-matches 0(a) and the theorem 
follows from j2l]. □ 

5 Extension for Linear Ordered Gapped Forests 

In this section, we show a polynomial time learning algorithm for /r-OGF using 
equivalence and subset queries. An ordered gapped forest (OGF, for short) is a 
finite set id = {G, . . . , tk} (k > 0) of ordered gapped tree patterns in OGT. An 
ordered gapped forest id is linear (linear OGF or /r-OGF, for short) if every 
member d G id is a linear OGT in g,-OGT. We denote by OGF and g,-OGF 
the classes of OGFs and ^-OGFs, respectively. Forests of OTs with the onto- 
semantics under a ranked alphabet are called unions of tree patterns in 0. 
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Definition 8. For an ordered gapped forest F[, we define the language of F[ with 
the into-semantics as the union Lin{H) = Uti^HLin(t) of the languages defined 
by its members. 

The following property is called the compactness mm and plays an im- 
portant role in the learning of unions of languages |8] . 

Lemma 6. Let E be an infinite alphabet. For any ordered gapped forests P,Q € 
OGF , Lin{P) D Lin{Q) if and only if for every q G Q there exists some p G P 
such that p ^ q. 

Proof. For every OGT q G Q, let Wq be a constant tree obtained from q by 
substituting mutually distinct constants not appearing in any of P for all tree 
and gap variables in q. Since Wg G L{Q), if Lin{P) D Li^iQ) then there exists 
some p G P such that Wq G L{P). Since any substituted symbols do not appear 
in P, if p □in Wq then we have p q by inverting the substitution. □ 

An intuitive idea behind our learning algorithm for OGF is to identify each 
member of the target forest FL, one by one using as a subroutine a learning 
algorithm with MQ or SQ for learning pL-OGT. For this purpose, a property 
which will be shown in Lemma [S] is essential. Below, we develop a new algorithm 
LEARN-INTO-LIN-OGT-SQ as such a subprocedure. 

A generalization operator for OGT is a binary relation 7 C OGTxOGT such 
that for every s,t G OGT, (s,f) G 7 implies s Gin t- For 7 and any OGT s, we 
define 7(5) = {t G OGT | (s,f) G 7 } and denote by 7+ the transitive closure 
of 7. Then, 7 is called efficient if for every s G OGT, #7(5) is polynomially 
bounded by |s| and all elements in 7(5) can be printed in polynomial time in |s|; 
7 is called proper if for every s,t G OGT, (s, t) G j implies s \Zi„ t, and complete 
if for every s,t G OGT, s Ci„ t implies (s, f) G 7+. 

Below, we give a generalization operator for fx-OGT. 

Definition 9 . Let s,t be pL-OGT. Then, {s,t) G 71 holds if t is obtained from s 
by applying one of the following operations Ol-Of: 

( 01 ) Lf s has a leaf v G Vs, then t = t\{u} is the tree obtained by removing v. 

( 02 ) Lf s has a gap node v G Vs, then t = s\{p} is the tree obtained by contract- 
ing s at either the non-root parent or the non-leaf child p of v. 

( 03 ) Lf s has a leaf node v G Vg with constant label c, then t = s[x/v\, where 
X G X\uar(s) is a new tree variable. 

(04) Lf s has a chain node v G Vg with constant label c, then t = s[x/v], where 
X G F\gap{s) is a new gap variable. 

Definition 10 . The operator 71 is an efficient and proper generalization oper- 
ator for pL-OGT. 

Proof. First, we see that 71 is proper. Let s,t be any OGT. If {s,t) G 71 then 
we have s G^n t by the construction of 71. Assume that the converse s t 
holds. Then, we have s =„ t and it follows from[ 2 ]of Lemma[T]that s and t are 
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Algorithm LEARN-INTO-LIN-OGT-BY-SQ 
Given-. Oracles for EQ and SQ for target fi-OGT t,. 

Output-. A /r-OGT t equivalent to t,. 

Method-. 

1 If {EQ(IJ) = yes) then retnrn _L. 

2 Let w be a counterexample returned by EQ. Set t = w. 

/* w is a positive example of */ 

3 While MQ{t') = yes for some t' £ 7i(t), set t = t' . 

/* 7 i is the refinement operator for fi-OGT defined in Dehnition|9l */ 

4 return t; 



Fig. 3. A learning algorithm for p-OGT with the into-match semantics using equiva- 
lence queries and subset queries 



identical modulo renaming. However, this is not possible by the construction of 
7 i. Hence, 71 is proper. We can see that for every s, #71 (s) = 0{n) and 71 (s) 
is polynomial time computable in n = |s|. □ 



Lemma 7. The generalization operator 71 is eomplete for pi-OGT. 

Proof. It is sufficient to show that for any pL-OGT s, t, if s t then there is 
a sequence sq = s, si, . . . , Sn = t {n > 0) of pi-OGT where for every 1 < t < n, 
(si_i,Si) G 7 i holds. If s =in t then the claim trivially holds. Thus, we assume 
that s Gin t with an into-matching (f. Then, we see that one of the operators 
(01)“(04) is applicable to obtain an s' G 7 ( 5 ) such that s' G t as follows. In 
the case that |s| < |t|, it follows from Lemma 0 that there are excess nodes e 
w.r.t. (j) that are a leaf node or a chain node next to a gap node, and we can 
apply operations (01) or (02) to e. In the case that |s| = |t|, there exists some 
constant node e whose label can be changed to either a tree or a gap node by 
(03) or (04). In either case, we obtain s' G 71 (s) such that s' Gt. □ 

In Fig.[^ we present a modified algorithm LEARN-INTO-OGT-SQ for learn- 
ing pi-OGT with EQ and SQ. To see the correctness of the algorithm, we prepare 
some notations and lemmas. Let t* G pi-OGT be the target OGT, to & T he the 
initial positive instance returned by EQ, and for every n > 0, let be the hy- 
potheses generated in the n-th execution of the while loop. Since 71 is proper 
and complete, we have the next lemma crucial for the main result. 

Lemma 8 . For any w G T , the sequence of hypotheses generated by the algo- 
rithm LEARN-INTO-OGT-SQ form an properly increasing sequence to = w [Zm 
tl Gin - - - Gin ti Gin ‘ ‘ ‘ ^in G it 'G 0)- 

To see the termination, we define the size complexity of an OGT t by 
sizeft) = 2 X |t| — iffvar{t) -\- ffgap{t)). Obviously, 0 < sizeit) < 2\t\ holds. 
The next lemma holds for full class of OGT. 

Lemma 9. For any OGT s,t, if s Gin t then size(s) > size{f) holds. 
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Algorithm LEARN-INTO-LIN-OGF 

Given-. Oracles for EQ and SQ for the target forest T*. 

Output An ordered gapped forest H equivalent to T*. 

Method-. 

1 := 0 ; 

2 While {EQ{H) = no), do: 

(a) Let w be a counterexample returned by EQ. 

(b) Run the algorithm LEARN-INTO-OGT-SQ of Fig.[3]with w as the initial 
positive counterexample and using SQ for T*. 

(c) Let t be the tree pattern returned by LEARN-INTO-OGT-SQ. 

(d) H :=Hu{t}. 

4 Return H. 



Fig. 4. A learning algorithm for /r-OGF with the into-match semantics using equiva- 
lence queries and subset queries 



Lemma 10. The algorithm LEARN-INTO-LIN-OGT-SQ of Fig. |3| exactly 
learns any linear ordered gapped tree pattern t* in pL-OGT with the into-match 
semantics in polynomial time in n using exactly one EQ and O(n^) SQ, where n 
is the size of the initial counterexample given by EQ. This also holds for either 
finite or infinite alphabet S . 

Proof. By similar arguments as in Theorem |2] □ 

In Fig. m we present the main algorithm LEARN-INTO-LIN-OGF for learn- 
ing OGF with EQ and SQ, which uses LEARN-INTO-OGT-SQ as a subroutine. 

Theorem 5. Let S be an infinite alphabet. The algorithm LEARN-INTO-LIN- 
OGF of Fig. exactly learns any fi-OGF T* with the into-match semantics in 
polynomial time in m and n using 0{m) EQ and 0{mn^) SQ, where m = 
and n is the maximum size of the counterexamples given by EQ. 

Proof. In the initial stage, the algorithm received a positive instance of since 
Lin{%) = 0 - By induction on the number of stages, we can show that the example 
given to the subroutine LEARN-INTO-OGT-SQ is always positive. We denote 
by H the current forest in the main algorithm. Let tg and t be the initial example 
to and hypothesis maintained by the subroutine. We will show that /i* t for 
some h* G iL* but h ^ t for any h G H hold, which ensure that the subroutine 
can correctly simulate SQ for Lin{TA)\Lin{H) using SQ for Lin{TQ). Suppose 
contrary that h O t for some h G H. Since t 3 by Lemma |S] this means 
h □ tg and thus tg G This contradicts the assumption, and we showed 

the claim. For any counterexample tg to SQ{H), tg G Li„(T,)\Li„(iJ), and thus 
there is some G T* such that f* 3 io with some matching (j). Based on the 
existence of 4>, the subroutine eventually identifies one of the such that G 3 ^o- 
Since the main algorithm identify at least one member of the target forest T, 
in each execution of its while loop, the while loop of the main algorithm can be 
executed at most m times. This completes the proof. □ 
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6 Learning Unrestricted Ordered Gapped Tree Patterns 

In this section, we present a modified algorithm for learning OGT with re- 
peated variables in polynomial time using equivalence and subset queries based 
on the technique of [2]. Since the membership query for OGT is NP-complete 
by Lemma H] this result may not have practical value. We include this section 
for indicating how to learn complex tree patterns in general. 

From the next lemma, we know that it is required to simultaneously replace 
several subtrees to learn OGT with repeated variables. Let 71 be the generaliza- 
tion operator for ^l-OGT introduced in Section |5] 

Lemma 11. Let s,t he OGT. Suppose that s t but s ^ t. Then, one of the 
following condition holds: 

1. There exists some OGT t' G Ji(t) such that s t' . 

2. There exists some OGT t' such that s t' and t' Z\in t defined as follows: 
For some set of at least two nodes V = {vi, . . . ,Vn} (n > 2) with any 
labels which have the identical subtrees tjv\ = ••• = t/u„, the OGT t' = 
t\zjv\, . . . ,zjvrf\ is obtained by replacing each t/vi {1 < i < n) with the 
nodes labeled with the copies of a new variable z G 'X.\var{f). 

Proof. From the definition of the into-matching in Definition |T] and the com- 
pleteness of the generalization operator 71 for /r-OGT in Lemma [71 □ 

To overcome this problem, Amoth, Gull, and Tadepalli [2] developed an el- 
egant partitioning technique to generalize repeated variables in unordered trees 
and forests. For OGT s,t, we denote by Ot{s) the set of all occurrences of s in 
t, i.e., all nodes v G Vt that s is the subtree of t with the root v, and by t\{t/v) 
the tree obtained by removing the subtree of t rooted at node v G Vt. In Fig. |S] 
we show a version of their algorithm. Partition, modified for OGT. 

Example 1. Let us explain the algorithm Partition with an example in Fig. O 
Let G = f{a{x,y),b{x,y,y)) be an OGT and t = f{a{a,a),b{a,a,a)) be an 
instance of t*. Assume that 71 is already applied to t and not applicable any 
more. In general, a may be either a constant or a tree. (1) First, the algorithm 
inserts the copies of new variable 2 ; at the right to each occurrences of constant 
a. (2) The algorithm first delete an old leaf a and (3) then a new leaf z using SQ. 
If these steps succeeds, then the matching from to t have to split the variables 
into two groups as shown in (3) since we have more subtrees a and x than any 
variable x or y. Finally, the resulting nodes z and x indicate this splitting. 



Lemma 12. Let be a target OGT and t be any OGT such that t* t. 

Suppose that s t hut s ^ t, and there is no OGT t' G 71 (t) such that 

s ^in t' . Then, the algorithm Partition in Fig. computes an OGT t' such that 
s t' Giin t. Furthermore, Partition runs in polynomial time using O(n^) SQ 
in n = |t|. 



Efficient Learning of Semi-structured Data from Queries 329 



Procedure Partition(t) 

/* Simultaneously changing identical subtrees with the same new variables. * j 
For each distinct subtree s of t do begin 

Ss := Ot{s) and k := ISsI; let z G X\war(t) be a new tree variable; 

For each u G Ss do: 

Create a new node u labeled with a copy of z; 

Attach u to the parent of v as the adjacent right sibling of v; 

S, ■.= Ot(z) and S := Ss US,; 

While t changes do: /* executed whenever |S| > fe */ 

If there is some u G S n Ss such that SQ{t\{t/v)) = yes then 

t := t\{t/v)\ S ~ S\{u}; /* removing the subtree rooted at v */ 

Else if there is some u G S n S, such that SQ{t\{v}) = yes then 
t := t\{u}; S := S\{u}; /* removing the variable node v * / 

If (s G X and S 7 ^ S,) or (s ^ X and S 7 ^ Ss) then 
Return t\ /* t is properly generalized * / 
end for; 

Return t\ /* t does not change * / 



Fig. 5. The partitioning algorithm using subset queries 



Theorem 6. There exists some algorithm that exactly learns any unrestricted 
ordered gapped tree pattern in OGT with the into-match semantics in polyno- 
mial time in n using exactly one EQ and 0{n^) SQ, where n is the size of the 
initial counterexample given by EQ. This also holds for either finite or infinite 
alphabet S . 

Proof. We modify the LEARN-INTO-LIN-OGT-SQ of Fig. used for learning 
p-OGT with EQ and SQ. The following is the outline of the modified algorithm, 
where 71 is the refinement operator for p-OGT defined in Definition El 

1 If {EQ{±) = yes) then return _L. 

2 Let w be a counterexample returned by EQ. Set t = w. 

3 While t changes during the loop, do: 

(a) If MQ{t') = yes for some T G 71 (t), then set t = t' . 

(b) Else Partition(t) returns an answer t', then set t = t' . 

4 return t; 

From Lemma [ 2 I 71 is a proper and complete generalization operator. From 
Lemrna. rmi LemmaEl and Lemmall2] we can show a similar property as LemmaEl 
On the other hand, the size complexity is also valid for OGT with repeated 
variables. This and LemmaElshow the termination after at most Ofn) execution 
of the while loop, and each execution requires 0{n^) SQ. From Lemma ITTl the 
algorithm terminates only when =into t*- This completes the proof. □ 

7 Conclusion 

In this paper, we presented efficient algorithms for learning the class of /r-OGT 
using equivalence and membership queries, and the class of /x-OGF and the class 
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X y X y y 
pattern tree 




a a a a a 
instance tree 



/ f / 

azazazazaz 
(1) initial state 

M Ml 

MzazazaXaz 
(3) delete one z 



(2) delete one a 
X y X 

\ / \ , , 

Xz aXXz aXaX 

(4) delete excess nodes 



Fig. 6. An example computation of the partition algorithm 



of unrestricted OGT with equivalence and subset queries under the into-match 
semantics. We also showed two hardness results which indicates that above two 
types of queries are necessary to efficiently learn /i-OGT and /i-OGF. Gonnection 
among the learnabilities of OGT, OGF, pattern languages and first-order 
logic m is a future problem. 
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Abstract. An elementary formal system (EFS) is a logic program snch 
as a Prolog program, for instance, that directly manipulates strings. 
Arikawa and his co-workers proposed elementary formal systems as a 
nnifying framework for formal langnage learning. 

In the present paper, we introduce advanced elementary formal systems 
(AEFSs), i.e., elementary formal systems which allow for the nse of a 
certain kind of negation, which is nonmonotonic, in essence, and which 
is conceptually close to negation as failure. 

We study the expressiveness of this approach by comparing certain AEFS 
definable language classes to the levels in the Chomsky hierarchy and to 
the language classes that are definable by EFSs that meet the same 
syntactical constraints. 

Moreover, we investigate the learnability of the corresponding AEFS 
definable langnage classes in two major learning paradigms, namely in 
Gold’s model of learning in the limit and Valiant’s model of probably 
approximately correct learning. In particular, we show which learnability 
results achieved for EFSs extend to AEFSs and which do not. 



1 Introduction and Motivation 

Elementary formal systems (EFSs) have been introduced by Smullyan [20] to 
develop his theory of recursive functions over strings. In [3] and in a series of 
subsequent publications like [5,24,4,6,19,25,14], for example, Arikawa and his co- 
workers proposed elementary formal systems as a unifying framework for formal 
language learning. 

EFSs are a kind of logic programs such as a Prolog programs, for instance. 
EFSs directly manipulate non-empty strings over some underlying alphabet and 
can be used to describe formal languages. For instance, the EFS depicted in 
Figure 1 describes the language that contains all non-empty strings of form 
a"b”. More formally speaking, if a ground atom p(w) can be derived from the 
given rules, then the string w has to be of form a"b". 

* This work has been partially supported by the German Ministry of Economics and 
Technology (BMWi) within the joint project LExIKON under grant 01 MD 949. 



N. Abe, R. Khardon, and T. Zeugmann (Eds.): ALT 2001, LNAI 2225, pp. 332—347, 2001. 
© Springer- Verlag Berlin Heidelberg 2001 




( 1 ) p{xy) ^ q{x,y). 

(2) q(a,b). 

(3) q(aa:,b 2 y) ^ q{x,y). 
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Fig. 1. An example EES 



Arikawa and his co-workers (cf. [5,4], e.g.) used EFSs as a uniform frame- 
work to define acceptors for formal languages. In this context, they discussed the 
relation of certain EFS definable language classes to the standard levels in the 
classical Chomsky hierarchy. In addition, they have studied the learnability/non- 
learnability of EFS definable language classes in different learning paradigms, 
including Gold’s [9] model of learning in the limit as well as Valiant’s [23] model 
of probably approximately correct learning (cf. [5,4,19,25,14], e.g.). For instance, 
the results in [18,19] impressively show that EFSs provide an appropriate frame- 
work to prove that rich language classes are Gold-style learnable from only pos- 
itive examples. 

In the present paper, we follow this line of research. But in generalizing ordi- 
nary EFSs, we introduce so-called advanced elementary formal systems (AEFSs, 
for short). In contrast to EFSs, an AEFS may additionally contain rules of the 
form A <— not B\, where A and B\ are atoms and not stands for a certain kind 
of negation, which is nonmonotonic, in essence, and which is conceptually close 
to negation as failure. Even this rather limited approach to use negation has 
its benefits in that it may seriously simplify the definition of formal languages. 
For instance, the following rules define the language of all square- free strings^. 
Formally speaking, a ground atom p(ic) can be derived only in case that the 
string w is square-free. 



( 1 ) p{x) ^ not q(x). 

(2) q)®®). 

(3) q{xy) ^ q(a;). 

(4) q{xy) ^ q{y). 



Fig. 2. An example AEFS 

The work reported in the present paper mainly draws its motivation from 
ongoing research related to knowledge discovery and information extraction (IE) 
in the World Wide Web. Documents prepared for the Internet in HTML, in XML 
or in any other syntax have to be interpreted by browsers sitting anywhere in 
the World Wide Web. For this purpose, the documents do need to contain syn- 
tactic expressions which are controlling its interpretation including its visual 
appearance and its interactive behaviour. While the document’s content is em- 
bedded into those syntactic expressions which are usually hidden from the user 
and which are obviously apart from the user’s interest, the user is typically in- 

^ As usual, a string w is square-free if it does not contain a non-empty substring of 
form vv. 
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terested in the information itself. Accordingly, the user deals exclusively with 
the desired contents, whereas a system for IE should deal with the syntax. 

In a characteristic scenario of system-supported IE, the user is taking a source 
document and is highlighting representative pieces of information that are of 
interest. Now, it is left to the system to understand how the target information 
is wrapped into syntactic expressions and to learn a procedure (henceforth called 
wrapper) that allows for an extraction of this information (cf. [12,21,8], e.g.). 

AEFSs seem to provide an appropriate framework to describe extraction pro- 
cedures that naturally comprises the approaches proposed in the IE community 
(cf. [12,22], e.g.). 

For illustration, consider the following table and its source which con- 

tains details about the first half-dozen of workshops on Algorithmic Learning 
Theory (ALT). The aim of the IE task is to extract all pairs (j/,c) that refer to 
the year y and the corresponding conference site c of a workshop in the ALT 
series that has proceedings co-edited by Arikawa. So, the pairs (1990, Tokyo) 
and (1994,Reinhardsbrunn) may serve as illustrating examples. 



Year 


Editors 


Publisher 


Conference Site 


1990 


Arikawa, Goto, Oshuga, Yokomori 


Ohmsha Ltd. 


Tokyo 


1991 


Arikawa, Maruoka, Sato 


Ohmsha Ltd. 


Tokyo 


1992 


Doshita, Furukawa, Jantke, Nishida 


Springer 


Tokyo 


1993 


Jantke, Kobayashi, Tomita, Yokomori 


Springer 


Tokyo 


1994 


Jantke, Arikawa 


Springer 


Reinhardsbrunn 


1995 


Jantke, Shinohara, Zeugmann 


Springer 


Fukuoka 



Fig. 3. Visual appearance of the sample document 



\begin{tabular!K I c I c I c I c I } 

\hline 

Year & Editors & Publisher & Conference Site \\\hline 

1990 & Arikawa, Goto, Oshuga, Yokomori & Dhmsha Ltd. & Tokyo \\\hline 

1991 & Arikawa, Maruoka, Sato & Ohmsha Ltd. & Tokyo \\\hline 

1992 & Doshita, Furukawa, Jantke, Nishida & Springer & Tokyo \\\hline 

1993 & Jantke, Kobayashi, Tomita, Yokomori & Springer & Tokyo \\\hline 

1994 & Jantke, Arikawa & Springer & Reinhardsbrunn \\\hline 

1995 & Jantke, Shinohara, Zeugmann & Springer & Fukuoka \\\hline 
\end{tabular} 

Fig. 4. FTLjX source of the sample document 



Note that the line breaks in Figure 4 have additionally been inserted to 
improve readability. 

An AEFS that describes how the required information is wrapped into the 
ETeX source in Figure 4 looks as follows: 





Extending Elementary Formal Systems 335 



( 1 ) extract(j/, c, a;o\hliney&a::i&a; 2 &c\\a; 3 ) 

( 2 ) p{x) <— not q(a:). 

(4) q(&). 

(6) q_{xy) ^ q(a;). 

(8) q_{xy) ^ q(t/). 



P{y), P(a:i), P(® 2 ), p(c), h(a;i). 

(3) h(Arikawa). 

(5) h(xj/) ^ h(a;). 

(7) h{xy) ^ h{y). 



Fig. 5. Sample wrapper represented as hereditary AEFS 

The first rule can be interpreted as follows: A year y and the conference site c 
can be extracted from a IATeX source document d in case that (i) d matches the 
pattern xo\hline?/&a;i&a; 2 &c\\a ;3 and (ii) the instantiations of the variables y, Xi, 
X 2 , and c meet certain constraints. For example, the constraint h(xi) states that 
the variable xi can only be replaced by some string that contains the substring 
Arikawa. Further constraints like p(y) explicitly state which text segments are 
suited to be substituted for the variable y, for instance. In this particular case, 
text segments that do not contain the substring & are allowed. If a document d 
matches the pattern xo\hline?/&a;i&a; 2 &c\\a :3 and if all specified constraints are 
fulfilled, then the instantiations of the variables y and c yield the information 
required. 

As the above example shows, the explicit use of logical negation seems to be 
quite useful, since it may help to describe wrappers in a natural way. In this 
particular case, the predicate p guarantees that the specified wrapper does not 
allow for the extraction of pairs (y,c) such that y and c belong to different rows 
in the table depicted in Figure 3. 

The focus of the present paper is twofold. On the one hand, we study the 
expressiveness of the proposed extention of EFSs by comparing certain AEFS 
definable language classes to the levels in the Chomsky hierarchy as well as to 
the language classes that are definable by EFSs that meet the same syntactical 
constraints. This may help to better understand the strength of the proposed 
framework. 

In the longterm, we are interested in IE systems that automatically infer 
wrappers from examples. With respect to the illustrating example above, we are 
targeting at learning systems that are able to infer, for instance, the wrapper of 
Figure 5 from the source document of Figure 4 together with the two samples 
(1990, Tokyo) and (1994,Reinhardsbrunn). Therefore, on the other hand, we 
investigate the learnability of the corresponding AEFS definable language classes 
in Gold’s [9] model of learning in the limit and Valiant’s [23] model of probably 
approximately correct learning. In this context, we systematically discuss the 
question which learnability results achieved for EFSs lift to AEFSs and which 
do not. 

2 Advanced Elementary Formal Systems 

AEFSs generalize Smullyan’s [20] elementary formal systems which he intro- 
duced to develop his theory of recursive functions over strings. 
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2.1 Preliminaries 

By S we denote any fixed finite alphabet. Let be the set of all non-empty 
words over E. Moreover, we let If” denote the set of all words in 17+ having 
length less than or equal to n, i.e., 17” = {w | ru € I7+, |w| < n}. Let o € 17. 
Then, for all n > 1, a"+^ = aa", where, by convention, = a. 

Any subset L C 17+ is called a language. By L we denote the complement 
of L, i.e., L = 17+ \ L. Furthermore, let £ be a language class. Then, we let 
£” = {Lnl7” I Le £}. 

By Creg, Ecf, Ccs, and Cre we denote the class of all regular, context free, 
context sensitive, and recursively enumerable languages, respectively. These are 
the standard levels in the well-known Chomsky hierarchy (cf. [10], e.g.). 

The following lemmata provide standard knowledge about context free lan- 
guages (cf. [10], e.g.) that is helpful in proving Theorem 8 below. 

Lemma 1 Let L C {a}+. Then, L € C^f iff L G Creg- 

Lemma 2 Let L C 17+ be a context free language and let I7q C 17. Then, 
L' = Ln Eq constitutes a context free language. 

2.2 Elementary Formal Systems 

Next, we provide notions and notations that allow for a formal definition of 
ordinary EFSs. 

Assume three mutually disjoint sets - a finite set 17 of characters, a finite 
set LI of predicate symbols, and an enumerable set X of variables. We call every 
element in (17 U A)+ a pattern and every string in 17+ a ground pattern. For a 
pattern tt, we let v{tt) be the set of variables in tt. 

Let p G TT be a predicate symbol of arity n and let tti, . . . , 7t„ be patterns. 
Let A = p(7Ti, . . . ,7T„). Then, A is said to be an atomic formula (an atom, for 
short). A is ground, if all the patterns 7Ti are ground. Moreover, v{A) denotes 
the set of variables in A. 

Let A and Bi, . . . , Bn be atoms. Then, r = A ^ T?i, . . . , TT„ is a rule, A is 
the head of r, and all the Bi form the body of r. If all atoms in r are ground, 
then r is a ground rule. Moreover, if n = 0, then r is called a fact. Sometimes, 
we write A instead of A 

Let (7 be a non-erasing substitution, i.e., a mapping from X to (I7U A)+ such 
that, for all but finitely many x € X, cr(x) = x. For any pattern tt, ttct is the 
pattern which one obtains when applying a to tt. Let C = p{tii, . . . ,7t„) be an 
atom and let r = A <— TTi, . . . , be a rule. Then, we set Ca = p{'K\a , . . . , 7r„cr) 
and ra = Aa <— B\a, . . . , Bna. If ra is ground, then it is said to be a ground 
instance of r. 

Definition 1 ([6]) Let E, II, and X be fixed, and let T be a finite set of rules 
over E, IT, and X. Then, S = {E,TI,r) is said to be an EPS. 

EFSs can be considered as particular logic programs without negation. There 
are two major differences: (i) patterns play the role of terms and (ii) unification 
has to be realized modulo the equational theory 
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E = {o(a;, o(y, z)) = o(o(x, y),z)}, 

where o is interpreted as concatenation of patterns. 

As for logic programs (cf. [13], e.g.), the semantics of an ordinary EPS S', 
denoted by Semo(S), can be defined via the operator Tg (see below) . In the corre- 
sponding definition, we use the following notations. For any EPS S = (A, 7T, E), 
we let B(S) denote the set of all well- formed ground atoms over E and 77. 
Moreover, we let G{S) denote the set of all ground instances of rules in E. 

Definition 2 Let S be an EES and let I C B{S). Then, we let Tg{I) = lU {A \ 
A ^ Bi, . . . ,Bn G G{S) for some 77i G 7, . . . , 73„ G 7}. 

Note that, by definition, the operator Tg is embedding (i.e., 7 C Tg{I) for 
all 7 C B{S)) and monotonic (i.e., ICE implies Tg{I) C Tg(I') for all I,E C 
73(S)). 

As usual, we let T^+^(7) = Tg{Tg{I)), where Tg{I) = 7, by convention. 

Definition 3 Let S be an EES. Then, we let Semo{S) = Unei^ (®)- 

In general, Semo{S) is semi-decidable, but not decidable. However, as we 
will see below, Semo{S) turns out to be decidable in case that S meets several 
natural syntactical constraints. 

Finally, by SES we denote the collection of all EFSs. 



2.3 Beyond Elementary Formal Systems 

Informally speaking, an AEFS is an EES that may additionally contain rules of 
the form A <— not B\, where A and B\ are atoms and not stands for a certain 
kind of negation, which is nonmonotonic, in essence, and which is conceptually 
close to negation as failure. The underlying meaning is as follows. If, for instance, 
A = p{xi, . . . , Xn) and Bi = q{xi, . . . , x„), then the predicate p succeeds iff the 
predicate q fails. 

However, taking the conceptual difficulties into consideration that occur 
when defining the semantics of logic programs with negation as failure (cf. [13], 
e.g.), AEFSs are constrained to meet several additional syntactic requirements 
(cf. Definition 4). The requirements posed guarantee that, similarly to stratified 
logic programs (cf. [13], e.g.), the semantics of AEFSs can easily be described. 
Moreover, as a side-effect, it is guaranteed that AEFSs inherit some of the con- 
venient properties of EFSs. 

Before formally defining how AEFSs look like, we need some more notations. 
Let 7^ be a set of rules (including rules of the form A ^ not Bi). Then, hp{E) 
denotes the set of predicate symbols that appear in the head of any rule in E. 

Definition 4 AEFSs and their semantics are inductively defined as follows. 

(1) An EES S' is also an AEFS and its semantics Sem(S') = Semo{S'). 

(2) If Si = {S, Hi, El) and S 2 = (77, II 2 , E 2 ) are AEFSs such that Bi n II 2 = 0, 
then S = (77, 77i U II 2 , A U A) is an AEFS and its semantics is Sem{S) = 
Sem{Si) U Sem{S 2 ). 
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(3) If Si = {S, III, A) is an AEFS and p ^ Ui and q € Hi are n-ary predicate 
symbols, then S = (E, ili U {p}, A U {p{xi, . . . , x„) ^ not q{xi, . . . , x„)}) 
is an AEFS and its semantics is Sem{S) = Sem{Si) U {p(si, . . . , s„) | 
p{si,...,Sn) e B{S),q{si,. . . ,Sn) ^ Sem{Si)}. 

(4) If Si = (A, Til, A) is an AEFS and S' = (A, 7T', A) is an EES such that 
hp{r')nlli = 0, then S = (A, il'UTTi, A'UAi) is an AEFS and its semantics 
is Sem{S) = UneiN 

Finally, by AETS we denote the collection of all AEFSs. 

According to Definition 4, the same AEFS may be constructed either via (2) 
or (4). Since Ts is both embedding and monotonic, the semantics is the same in 
both cases. 

2.4 Using AEFS for Defining Formal Languages 

In the following, we show how AEFSs can be used to describe formal languages 
and relate the resulting language classes to the language classes of the classical 
Chomsky hierarchy. 

Definition 5 Let S = (A, II, E) be an AEFS and letp € II be a unary predicate 
symbol. Then, we let L{S,p) = {s | p{s) G Sem{S)}. 

Furthermore, a language L C A+ is said to be AEFS definable iff there are 
a superset Aq of E, an AEFS S = {Eq, II, E), and a unary predicate symbol 
p G n with L = L{S,p). 

Intuitively speaking, L{S,p) is the language which the AEFS S defines via 
the unary predicate symbol p. 

Definition 6 Let M C AEFS and let k G TN. Then, E(Ai) is the set of all 
languages that are definable by AEFSs in A4. Moreover, L{M.{k)) is the set of 
all languages that are definable by AEFSs in M. that have at most k rules. 

For example, £{AEFS{2)) is the class of all languages that are definable by 
unconstrained AEFSs that consist of at most 2 rules. 

Our first result puts the expressive power of AEFSs into the right perspective. 

Theorem 1 C £{AEFS). 

Proof: Since, by definition, £{EFS) C £{AEFS), and £^e C £{EFS) (cf. [6]), 
we get £re £ £{AEFS). Since there are languages L G £re that have a comple- 
ment which is not recursively enumerable (cf. [17]), £{AEFS) \ £re yf 0 is an 
immediate consequence of Theorem 2 below. □ 

Moreover, the following closedness properties can be shown. 

Theorem 2 £{AEFS) is closed under the operations union, intersection, and 
complement. 

To elaborate a more accurate picture, similarly to [6], we next introduce 
several constraints on the structure of the rules an AEFS may contain. 

Let r be a rule of form A ^ Bi, . . . , B„. Then, r is said to be variable- 
bounded iff, for all i < n, v{Bi) C v{A). Moreover, r is said to be length-bounded 
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iff, for all substitutions a, \Aa\ > |i?icr| + • • • + \Bn(j\- Clearly, if r is length- 
bounded, then r is also variable-bounded. Note that, in general, the opposite 
does not hold. 

Moreover, let r be a rule of form p(7r) ^ qi{xi), . . . , where xi, . . . ,Xn 

are mutually distinct variables and tt is a regular^ pattern which contains exactly 
the variables xi, . . . ,x„, then r is said to be regular. 

In addition, every rule of form p{xi, . . . ,Xn) ^ not q{x\, . . . ,Xn) is both 
variable-bounded and length-bounded. Moreover, every rule of form p{x) <— 
not q{x) is regular. 

Definition 7 Let S = {S,II,r) be an AEFS. Then, S is said to be 

(1) variable-bounded iff all r & T are variable-bounded, 

(2) length-bounded iff all r € T are length-bounded, and 

(3) regular iff all r ^ T are regular. 

By vb-ASTS (vb-STS), Ib-ASTS (Ib-STS), and reg-ASTS (reg-STS) we 
denote the eollection of all AEFSs (EFSs) that are variable-bounded, length- 
bounded, and regular, respectively. 

The following three theorems illuminate the expressive power of ordinary 
EFSs. 

Theorem 3 ([6]) 

(1) C{vb-STS) C Cre- 

(2) For any L G Cre, there is a L' G C{vb-£FS) such that L = L' D . 

If S contains at least two symbols, assertion (2) rewrites to Cre C{vb-ETS) 
(cf. [6]). 

Theorem 4 ([6]) 

(1) C{lb-£TS) c Ces- 

(2) For any L G Ces, there is a L' G C{lb-£iFS) such that L = L' D . 

Theorem 5 ([6]) C{reg-£TS) = Ccf. 

Concerning AEFSs the situation changes slightly. This is mainly caused by 
the fact that variable-bounded, length-bounded, and regular AEFSs are closed 
under intersection. 

Theorem 6 C{vb-A£TS), C{lb-A£TS), and C{reg-A£TS) are closed under 
the operations union, intersection, and complement. 

For AEFSs, Theorems 3 and 4 rewrites as follows. 

Theorem 7 

(1) Cre C C{vb-A£TS). 

(2) C{lb-A£FS) ^ Ces- 

Proof: First, we show (1). Applying Theorem 6, one sees that assertion (2) of 
Theorem 3 rewrites to Cre C C{vb-A£iFS). Next, C{vb-A£TS) \Cre 0 can be 
shown by applying the same arguments as in the demonstration of Theorem 1 . 

^ A pattern tt is said to be regular iff every variable occurs at most once in tt. 
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Second, we verify (2). Again, applying Theorem 6, one directly sees that 
assertion (2) of Theorem 4 rewrites to Ccs C C{lh-A£TS). Moreover, by defi- 
nition, for any L G C{lh-A£TS), there are languages Lo,...,T„ G C{lh-£TS) 
such that L can be defined by applying the operations union and intersection 
to these languages. Since C{lh-£TS) C Ccs and since Ccs is closed with respect 
to the operations union and intersection (cf. [101, e.g.), we may conclude that 
C{lh-A£TS) C Ccs- □ 

In our opinion, assertion (2) of Theorem 7 witnesses the naturalness of our 
approach to extend EFSs to AEFSs. In contrast to assertion (2) of Theorem 4, 
there is no need to use auxilary characters in the terminal alphabet. 

Theorem 8 Ccf C C{reg-A£TS) C Ccs- 

Proof: First, Ccf Q C{reg-A£TS) C Ccs follows immediately from Theorems 5 
and 7. 

Second, Ccf C C{reg-A£TS) follows from the fact that C(reg-A£TS) is 
closed under intersection (cf. Theorem 6), while Ccf is not (cf. [10], e.g.). 

Third, we show that Ccs\£(jeg-A£TS) yf 0. Let L C {a}"*" with L G Ccs\£cf 
(cf. [10], for some illustrating examples). We claim that L ^ C(jeg-A£TS). 
Suppose the contrary, i.e., L G C(reg-A£TS). By definition, there are languages 
Lq, . . . ,Ln G C{reg-£TS) such that L can be defined by applying the operations 
union and intersection to these languages. Let i < n. By Theorem 5, Li G Ccf- 
Moreover, let L' = n {a}'*". By Lemma 2, L' G Ccf, and thus, by Lemma 1, 
L' G Creg- Finally, one easily sees that L can also be defined by applying the 
operations union and intersection to the languages Lq, . . . , L'^. Finally, since Creg 
is closed with respect to the operations union and intersection, we may therefore 
conclude that L G Creg which in turn yields L G Ccf, a contradiction. □ 

3 Learning of AEFSs 

3.1 Notions and Notations 

First, we briefly review the necessary basic concepts concerning Gold’s [9] model 
of learning in the limit. We refer the reader to the survey papers [2] and [26] as 
well as to the textbook [11] which contain all missing details. 

There are several ways to present information about formal languages to be 
learned. The basic approaches are defined via the key concept text and informant, 
respectively. Let L be the target language. A text for L is just any sequence of 
words labelled ‘-I-’ that exhausts L. An informant for L is any sequence of words 
labelled alternatively either by ‘-I-’ or ’ such that all the words labelled by ‘-I-’ 
form a text for L, while the remaining words labelled by ’ constitute a text 
for L. Sometimes, labelled words are called examples. 

As in [9], we define an inductive inference machine (abbr. IIM) to be an 
algorithmic device working as follows: The IIM takes as its input larger and larger 
initial segments of a text (an informant). After processing an initial segment a, 
the IIM outputs a hypothesis M(a), i.e., a number encoding a certain computer 
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program. More formally, an IIM maps finite sequences of elements from x 
{+, — } into numbers in IN. 

The numbers output by an IIM are interpreted with respect to a suitably 
chosen hypothesis space H = (/ij)jgiN. When an IIM outputs some number j, 
we interpret it to mean that the machine is hypothesizing hj. 

Now, let £ be a language class, let £ be a language, and let H = 
be a hypothesis space. An IIM M LimTxt-H {Liminf ■j-i)-leams L iff, for every 
text t for L (for every informant i for L), there exists a j G IN such that hj = L, 
and moreover M almost always outputs the hypothesis j when fed the text t 
(the informant i). Furthermore, an IIM M LimTxt-H {Liminf j-f)-learns C iff, 
for every L G C, M LimTxtji,{LimInf j-f)-\esxTis L. In addition, we write C G 
LimTxt {£ G Liminf) provided there are a hypothesis space H and an IIM M 
that LimTxt-u (Lzm/n/ -^(-learns C. 

Next, we focus our attention on Valiant’s [23] model of probably approxi- 
mately correct learning (PAC model, for short; see also the textbook [16] for 
further details). In contrast to Gold’s [9] model, the focus is now on learning al- 
gorithms that, based on randomly chosen positive and negative examples, find, 
fast and with high probability, a sufficiently good approximation of the target 
language. 

To give a precise definition of the PAC model, we need the following notions 
and notations. We use a finite alphabet A for representing languages. A repre- 
sentation for a language class £ is a function R : £ ^ p{A~^) such that, for 
all distinct languages L, L' G £, R{L) yf 0 and R{L) n R{L') = 0. Let £ G £. 
Then, R{L) is the set of representations for £ and R) is the length of the 

shortest string in R{L). Moreover, let £ be a set of examples. Then, £^in{T,R) 
is the length of a shortest representation in R{L) that is consistent^ with T. 

Definition 8 ([23]) A language class £ is polynomial-time PAC learnable in a 
representation R iff there exists a learning algorithm A such that 

(1) A takes a sequence of examples as input and runs in polynomial time with 
respect to the length of the input; 

(2) there exists a polynomial q{-, ■,■,■) such that, for any £ G £, any n G IN, any 
s > 1, any reals e, d with 0 < e,d < 1, and any probability distribution Pr 
on A”, if A takes g(l/e, l/d, n, s) examples, which are generated randomly 
according to Pr, then A outputs, with probability at least 1 — d, a hypothesis 
h G R with Pr{w G ((£ \ /i) U (ft. \ £))) < e, when ^min(£j R) < s is satisfied. 

We complete this section by providing some more notions and notations that 
are of relevance when proving some of the learnability/non-learnability results 
presented below. 

Definition 9 A pair {S,p) of an AEFS S = {S,II,r) and a unary predicate 
symbol p G II is said to be reduced with respect to a set T of examples iff L{S,p) 
is consistent with T and, for any S' = (A, iT, £') with P' C P, L{S',p) is not 
consistent with T. 

® As usual, a language £ is said to be consistent with T iff, for all {x, +) G T, x G L 
and, for all {x, —) G T, x ^ L. 
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The following notion adopts one of the key concepts in [18], where it has been 
shown that, for classes of elementary formal systems, bounded finite thickness 
implies that the corresponding language class is learnable in the limit from only 
positive examples. 

Definition 10 ([18]) Let M C ETS. M is said to have hounded finite thickness 
i/f > foT all w G , there are at most finitely many EPS S G M such that (i) 
S is reduced with respect to T = {(ru, +)} and (ii) the language defined by S is 
consistent with T. 

Finally, we define the notion polynomial dimension which is one of the key 
notions when studying the learnability of formal languages in the PAG model. 

Definition 11 ([15]) Let C he a language class. L has polynomial dimension 
iff there is a polynomial d{-) such that, for all n € IN, log 2 J/l"] < d{n). 



3.2 Gold-Style Learning 

The following theorem summarizes the known learnability results for EFSs. Re- 
call that, by definition, C{lb-£TS{k)) is the collection of all languages that are 
definable by length-bounded EFSs that consist of at most k rules. 

Theorem 9 ([9,19]) 

(1) C{lh-£TS) G LimLnf. 

(2) C{lb-£iFS) ^ LimTxt. 

(3) For all fc € IN, C{lb-£iFS{k)) G LimTxt. 

Having in mind that L{lh-£TS) = L{lh-A£TS), we may directly conclude: 

Corollary 1. 

(1) C{lh-A£TS) G LimLnf. 

(2) c\lh-A£TS) LimTxt. 

The next theorem points to a major difference concerning the learnability of 
EFSs and AEFSs, respectively. 

Theorem 10 

(1) C{lh-A£TS{\y) G LimTxt. 

(2) For all k>2, C{lh-A£TS{k)) ^ LimTxt. 

Proof: By definition, C{lh-A£!FS{1)) = C{lh-£!FS{1)), and thus (1) follows 
from Theorem 9. 

Next, let k = 2. Let E = {a} and consider the family C = such that 

Lq = {a” 1 n G IN} and = {a” \ n <i + l}. C can be defined via the family 
of regular AEFSs (S', = (A, 77, rj))ig]N with LI = {p,q\, Fq = {p{a), p{ax) ^ 
p(a;)}, and Fi = {q{a^x), p{x) ^ not q{x)} for all i > 1. Obviously, for every 
i G IN, L{Si,p) = Li. On the other hand, it is well-known that C LimTxt 
(cf. [26], e.g.), and therefore we are done. □ 
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3.3 Probably Approximately Correct Learning 

In [4,14], the polynomial-time PAC learnability of several language classes that 
are definable by EFSs has been studied. It has been shown that even quite 
simple classes are not polynomial-time PAC learnable - for instance, the class 
of all regular pattern languages^. However, if one bounds the number of vari- 
ables that may occur in the defining patterns, regular pattern languages become 
polynomial-time PAC learnable. Moreover, by putting further constraints on the 
rules that can be used to define EFSs, positive results for even larger EES defin- 
able language classes have been achieved (cf. [4,14]). The relevant technicalities 
are as follows. 

A rule of form p(7Ti, . . . , 7r„) ^ pi{ri, . . . ,nj, . . . ■ ■ ■ ,Tt^) is 

said to be hereditary iff, for every j = 1, ... the pattern Tj is a subword of 
some pattern tt^. Moreover, any rule of form p{x\, . . . , x„) <— not q{x\, . . . , x„) 
is a hereditary one, since it obviously meets the syntactical constraints stated 
above. Note that, by definition, every hereditary rule is variable-bounded. 

Definition 12 Let S = {S,II,r) be an AEFS. Then, S is said to he hereditary 
iff all r G r are hereditary. By h-A£TS (h-ETS) we denote the collection of all 
hereditary AEFSs (EFSs). 

In contrast to the general case (cf. Definition 5), hereditary AEFS have the 
following nice feature. Let L C A+ with L G Cfh-AETS). Then, there is a 
hereditary AEFS for L consisting only of rules that uses exclusively characters 
from S. 

Definition 13 Let m, k,t,r G IN. By h-A£TS{m, k, t, r) (h-£TS{m, k, t, r) ) we 
denote the collection of all hereditary AEFSs (EFSs) S that satisfy (1) to ()), 
where 

(1) S contains at most m rules. 

(2) the number of variable occurrences in the head of every rule in S is at most k. 

(3) the number of atoms in the body of every rule in S is at most t. 

(4) the arity of each predicate symbol in S is at most r. 

Taking into consideration that L{reg-ETS) = Lcf (cf. Theorem 5), one can 
easily show that L(reg-ETS) C C{h-EES{m, 2, 1, 2)) (cf. [4]). Similarly, it 

can easily be verified that c{reg-A£TS) C C{h-AETS{m, 2, 1, 2)). Hence, 

hereditary EFSs resp. AEFSs are much more expressive than it might seem. 

For hereditary EFSs, the following learnability result is known. 

Theorem 11 ([14]) Let m,k,t,r G IN. Then, the class C{h-ETS{m,k,t,r)) is 
polynomial-time PAC learnable. 

As the results in [14] impressively show, it is inevitable to a priori bound all 
the defining parameters. In other words, none of the resulting language classes is 
polynomial-time PAC learnable, if at least one of the parameters involved may 
arbitrarily grow. 

That is, the class of all languages that are definable by an EES that consists of 
exactly one rnle of form p(k), where tt is a regular pattern. 
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Next, we turn our attention to study the learnability of language classes that 
are definable by hereditary AEFSs. 

Our first result demonstrates that hereditary AEFSs are more expressive 
than hereditary EFSs. 

Theorem 12 

Proof: Consider the language family L = (Li)ig]N such that Lq = {a” | n € IN} 
and Li+i = {a" | n < i + 1}. Having a closer look at the demonstration of 
Theorem 10, one directly sees that C G C{h-A£TS{2, 1, 1, 1)). 

We claim that £ witnesses the stated separation. Suppose to the con- 
trary that there are m, fc,t, r G IN such that £ G £{h-£TS{m,k,'tif))- Since 
£ ^ LimTxt (cf. [26], e.g.), this directly implies lF5(m, fc, t, r)) ^ LimTxt. 
However, by combining results from [18] and [14], it can easily be shown that 
£{h-£TS{m,k,ti'i’)) G LimTxt, a contradiction. The relevant details are as fol- 
lows: It has been shown that, for every m,k,t,r G IN, £{h-£TS{m,k,t,r)) has 
polynomial dimension (cf. [14]; see also Lemma 4 in the demonstration of The- 
orem 13 below). Moreover, every EES definable language class with polynomial 
dimension has bounded finite thickness which in turn implies that this language 
class is Lzm Tajt -identifiable (cf. [18]).® □ 

Surprisingly, Theorem 11 remains valid in case that one considers hereditary 
AEFSs instead of EFSs. This nicely contrasts the fact that, in Gold’s [9] model, 
AEFS definable language classes may become harder to learn than EES defin- 
able ones, although they are supposed to meet the same syntactical constraints 
(cf. Theorems 9 and 10). Moreover, having Theorem 12 in mind, the next the- 
orem establishes the polynomial-time PAC learnability of a language class that 
properly comprises the class in [14]. 

Theorem 13 Let m,k,t,r G IN. Then, the class £{h-A£TS{m,k,t,r)) is poly- 
nomial-time PAC learnahle. 

Proof: Let m,k,t,r G IN. Furthermore, let £ = £{h-A£iFS{m,k,t,r)) and let 
i? be a mapping that assigns AEFSs in h-A£TS{m,k,t,r) to languages in £. 
Applying results from [7] and [15], it suffices to show: 

(1) T is of polynomial dimension. 

(2) There is a polynomial-time finder for R, i.e., there exists a polynomial-time 
algorithm that, given a finite set T of examples for any L G T, computes an 
AEFS S G h-A£TS{m,k,t,r) that is consistent with T. 

Due to the space constraints, a formal verification of (1) is provided, only. Note 
that (2) can be show be adapting ideas used in [14] to demonstrate Theorem 11. 
The differences rest on the fact that the entailment relation for AEFSs does not 
meet the monotonicity principle of classical logics. 

Lemma 3 Let T he a set of examples over S. Furthermore, let {S,p) he a pair 
consisting of a hereditary AEFS S = (27, LI, F) and a unary predicate symbol 

® Note that, for AEFS definable language classes, an analogne implication does not 
hold. This is cansed by the fact that the entailment relation for AEFSs does not 
meet the monotonicity principle of classical logics. 
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p G n . If (S,p) is reduced with respect to T, then for each rule qo{TTi, ■ ■ ■ , <— 

qi{Tr \, . . . , TT^j), • ■ ■ , ) • • ■ ) ^ there exists a substitution a such that 

all the nfa are subwords of some labelled word from T. 

Proof: Assume the contrary. Let T be a set of examples over S, let (S,p) be 
a pair consisting of a hereditary AEFS S = (A, 77, P) and a unary predicate 
symbol p in U such that (S',p) is reduced with respect to T. Moreover, let 
T = . . . ,7T°J ^ <7i(7r^,...,7r^J,...,gt/(7rf,...,7T*'^,) be a rule in P that 

violates the assertions stated in Lemma 3. 

We claim that L{S',p) with S' = (A, 77, 7^') is also consistent with T, where 
r' = r \ {r}. To see this, assume the contrary. 

Case 1: There is a word w such that (w, +) G T and w ^ L{S',p). 

Hence, during the derivation® of p{w), a ground instance ra of rule r has to be 
used. Since S is hereditary, each 7r®cr, . . . , is a subword of w. Consequently, 
this implies that all tt ^ a are subwords of w, contradicting our assumption. 

Case 2: There is a word w such that (w, —) G T and w G L{S' ,p). 

Hence, there must be an atom p' {w\, . . . ,Wr') that is used when deriving 
p{w) such that (i) p' {w\, . . . ,Wr>) G Sem(S'), (ii) p' {w\ . . . ,Wr') ^ Sem{S), 
and (iii) there is a rule p'{xi, . . . ,Xr>) <— not q'(xi, . . . ,Xr') in T' such that 
q' {wi, . . . ,Wr>) G Sem{S) and q' {wi, . . . ,Wr>) ^ Sem(S'). Since S is heredi- 
tary, all wi, . . . , Wr' are subwords of w. Now, analogously to Case 1, during the 
derivation of q'{wi , . . . , Wr>) according to the rules in S, a ground instance rcr of 
rule r has to be used. As argued above, all the are subwords of the words 
wi, . . . ,Wn, and therefore they are subwords of w, too. Since {w, — ) G T, this 
contradicts our assumption. 

Summing up, L{S',p) must be consistent with T. Hence, S is not reduced 
with respect to T, a contradiction, and thus Lemma 3 follows. o 

Lemma 4 For any m,k,t,‘r G IN, the class L{h-A£TS{m,k,t,r)) has polyno- 
mial dimension. 

Proof: Let m,k,t,r G IN be fixed. We estimate the cardinality of the class 
C{h-A£tFS{m,k,t,r))" in dependence on n. 

Let {S,p) be a pair of an hereditary AEFS S = {S,n,r) G 

h-AETS{m, k, t, r) and a predicate symbol p G II of arity one. Since P contains 
at most m rules, we may assume that |77| < m. Furthermore, we may assume that 
(S,p) is reduced with respect to some finite set of examples T C A” x {-I-, — }. 

By definition, each rule in P is either of form (i) A <— 7?i, . . . , 77^ or of form 
(ii) A' ^ not B[, where A' = p'{x \, . . . , Xj) and 77( = q'{xi , . . . , xj) for some 
p',q' G 77 and variables Xi, ... ,Xj. Because of Lemma 3, the same counting ar- 
guments as in [14] can be invoked to show that there at most 0(2”^^) rules of 
form (i). Moreover, as a simple calculation shows, there are 0((2mr'’)^) rules of 
form (ii) (which does not depend on n). Consequently, there are at most 0(2" ) 
rules that can be used when defining an AEFS in h-A£!FS{m^k,t,r), and thus 

® We abstain from formally defining the term derivation, since an intuitive under- 
standing shall suffice. For the missing details, the interested reader is refered to [6], 
for instance. 
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there are at most 0(2" ) hereditary AEFS with at most m rules that have to be 
considered when estimating the cardinality of the class £{h-AS!FS{m, . 

Hence, the class C{h-A£TS{m, has polynomial dimension, and thus 

Lemma 4 follows. o 

Hence, (1) is indeed fulfilled. □ 

4 Conclusions 

Motivated by research related to knowledge discovery and information extraction 
in the World Wide Web, we introduced advanced elementary formal systems 
(AEFSs) - a kind of logic programs to manipulate strings. 

The authors are currently applying the approach presented here within a joint 
research and development project named LExIKON on information extraction 
from the Internet. This project is supported by the German Federal Ministry for 
Economics and Technology. 

Advanced elementary formal systems generalize elementary formal systems 
(EFSs) in that they allow for the use of a certain kind of negation, which is non- 
monotonic, in essence, and which is conceptually close to negation as failure. In 
our approach, we syntactically constrained the use of negation. This guarantees 
that AEFSs inherit some of the convenient properties of EFSs - for instance, 
their clear and easy to capture semantics. 

Negation as failure allows one to describe formal languages in a more natural 
and compact manner. Moreover, as Theorems 7 and 8 show, AEFSs are more 
expressive than EFSs. Naturally, this leads to the question of whether or not 
the known learnability results for EES definable language classes remain valid if 
one considers the more general framework of AEFSs. Interestingly, the answer 
to this question heavily depends on the underlying learning paradigm. 

As we have shown, certain AEFS definable language classes are not Gold- 
style learnable from only positive data, although the corresponding language 
classes that are definable by EFSs are known to be learnable (cf. Theorem 10). 
Surprisingly, in the PAG model, differences of this type cannot be observed 
(cf. Theorems 11 and 13). Although the considered classes of AEFS definable lan- 
guages properly comprise the corresponding classes of EES definable languages 
- which are the largest classes of EES definable languages formerly known to 
be polynomial-time PAG learnable - both language classes are polynomial-time 
PAG learnable. 
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Abstract. Residual languages are important and natural components 
of regular languages. Most approaches in grammatical inference rely on 
this notion. Classical algorithms such as RPNI try to identify prefixes 
of positive learning examples which give rise to identical residual lan- 
guages. Here, we study inclusion relations between residual languages. 
We lead experiments which show that when regular languages are ran- 
domly drawn using non deterministic representations, the number of in- 
clusion relations is very important. We introduced in previous articles a 
new class of automata which is defined using the notion of residual lan- 
guages: residual finite state automata (RFSA). RFSA representations of 
regular languages may have far less states than DFA representations. We 
prove that RFSA are not polynomially characterizable. However, we de- 
sign a new learning algorithm, DeLeTe2, based on the search of inclusion 
relations between residual languages, which produces a RFSA and have 
both good theoretical properties and good experimental performances. 



1 Introduction 

The subject of this paper is grammatical inference of regular languages. Most 
classical approaches in this field represent regular languages by Deterministic 
Finite State Automata (DFA): operations on DFA are fast, and every regular 
language admits a unique minimal DFA. However, it is well known that DFA is 
not the most optimal way of representing regular languages: the size of minimal 
DFA of languages as simple as A'*0A'" is exponential with respect to n: thus 
these languages cannot be learned by classical algorithms in reasonable time. It 
seems natural to learn regular languages using non deterministic representations 
jCFQQ], |DTTnn| . We presented in [DLTniJ a new class of non deterministic finite 
automata, the class of Residual Finite State Automata (RFSA). The residual 
language of a language L with regard to a word u is the set of words v such that 
uv is in L. The number of distinct residual languages of a regular language is 
finite (Myhill-Nerode theorem). The definition of RFSA is based on this notion. 
RFSA have some interesting properties: for example, every regular language can 
be represented by a unique minimal canonical RFSA which can be exponentially 

* This work was partially supported by the “pro jet TIC du CPER TACT - region 
Nord - Pas de Calais” 
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smaller than its equivalent minimal DFA. A first learning algorithm using this 
representation has been described in |DLT00] . 

We first study the notion of residual language. A residual language of a reg- 
ular language is said to be prime if it is not union of other residual languages. 
Each state of the canonical RFSA of a language L is associated with one prime 
residual language of L. We first study the ratio between the number of prime 
residual languages of L and the total number of residual languages of L. The 
second aspect we study is the number of inclusion relations between distinct 
residual languages of L, as this notion is used to build RFSA and can be ex- 
ploited in learning algorithms. In Section 3, we study theses parameters from 
an experimental point of view, and show they highly depend on the way reg- 
ular languages are generated. If languages are generated using random DFA, 
most residual languages are prime, and there are few inclusion relations between 
them. But if regular languages are generated using NFA or regular expressions, 
the number of prime residual languages is often small with regard to the total 
number of residual languages, and there are a lot of inclusions between them. 

These results suggest two families of learning algorithms that we study in 
Section 4: the first approach would be to seek after prime residual languages: we 
show in Section 4.1 that this identification is impossible using a sample poly- 
nomial with respect to the size of the canonical RFSA. The second approach 
would be to look after inclusion relations between residual languages: this ap- 
proach is developed in Section 4.2 where we introduce a new learning algorithm 
(DeLeTe2). When classical algorithms, like RPNI [( )(I92j . look after equiva- 
lent residual languages and merge states to obtain a DFA, DeLeTe2 looks after 
inclusion of residual languages and uses them to add transitions to the cur- 
rent automaton, and obtain a RFSA. Section 5 presents experimental results 
of DeLeTe2, and shows that, when regular languages are generated using non 
deterministic representations, DeLeTe2 has good performances. 



2 Preliminaries 

The reader may refer to [Yu97| . [IHU79j for classical definitions and proofs on 
formal language theory. The notions of prime and composed residual languages 
and RFSA have been introduced and studied in [DLTnilJ . 



2.1 Regular Languages, Regular Expressions, and Automata 

Let A be a finite alphabet and let S* be the set of words on A. We note e 
the empty word and |u| the length of a word u. A language is a subset of A*. 
For any word u and any language L, we note Pref{u) = {u|3w vw = u} and 
Pref{L) = U{Pref{u)\u G L}. We note < the usual lexicographic ordei0. A 
non deterministic finite automaton (NFA) is a quintuple A = {E,Q,Qq, F,S) 
where Q is a finite set of states, Qo Q Q is the set of initial states, F Q Q 



^ For example, first words defined on A = {a, fe} are e, a, b, aa, ab, ba, bb, . . . 
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is the set of final states, 6 is the transition function defined from a subset of 
Q X 17 to 2'^. As usual, we also denote by 6 the extended transition function 
defined from a subset of 2^ x S* to 2^ . We take the number of states of a NFA 
as a measure of its size. A NFA is deterministic (DFA) if Qq contains exactly 
one element qo and if Vg G Q,'ix G S, Card{5{q,x)) < 1. A DFA is complete 
if Vq G Q, 'ix G 27, Card{6{q,x)) = 1. A NFA is trimmed if Vq G Q, 3wi, 
W 2 G 27*, q G S{Qo,wi) and S{q,W 2 ) n F’ 0. A word w G 27* is recognized 
by a NFA if 6{Qo,u) n F" 0 and the language La recognized by A is the set 
of the words recognized by A. We denote by Rec{S*) the class of recognizable 
languages over 27*. There exists a unique minimal DFA that recognizes a given 
recognizable language (minimal with regard to the number of states and unique 
up to an isomorphism). A regular expression e denotes a regular language L if 
e = 0 and L{e) = 0; e = e and F(e) = {e}; e = x and F(e) = {cc} where 
a; G 27; e = Cl + 62 and F(e) = F(ei) U F(e 2 ); e = 6 i • 62 and F(e) = F(ei)F(e 2 ); 
6 = 6 * and L{e) = {L{ei))*. The Kleene theorem proves that the class of regular 
languages Reg{S*) is identical to Rec{S*). 

2.2 Residual Languages and RFSA 

For any language L and any word u over 27, we note u~^L = {v G E* \uv G L} 
the residual language of L associated with u (also called Brzozowski derivative 
[IBrz64p . The set of distinct residual languages of any regular language is finite 
(Myhill-Nerode theorem). A residual language is composed if it is equal to the 
union of the residual languages it strictly contains i.e. u~^L is composed if and 
only if u~^L = [J{v~^L \ v~^L C u~^L} . A residual language is prime if it is 
not composed. Let A = {S,Q,Qo,F,S) be a NFA and let q G Q. We note Lq 
the language defined by Lq = {u|( 5 ( 9 , v) C\ F ^ 0}. If A is a trimmed DFA, Lq is 
always a residual language of La'-, moreover, if A is minimal, for every non empty 
residual language u~^L, there exists a unique q G Q such that Lq — u~^L. A 
Residual Finite State Automaton (RFSA) is a NFA A = {E,Q,Qq, F,S) such 
that, for each state q G Q, Lq is a, residual language of La- Trimmed DFA are 
RFSA. It can be proved that for each prime residual u~^La of a RFSA A, there 
exists a state q such that Lq = u~^La- A state g of a RFSA is said to be prime 
(resp. composed) if the residual language it defines is prime (resp. composed). 
The saturated of a RFSA A is A"* = (27, Q, Qq, F, S^) with Qq = {q G Q \ Lq C L} 
and Vg G Q,\/x G 27, 5’^{q,x) = {q' G Q \ Lq' C x~^Lq}. It can be shown that 
the saturated of a RFSA is a RFSA. A saturated RFSA can be reduced by 
deleting its composed states. We then obtain a unique minimal (with regard 
to the number of states) trimmed saturated RFSA which recognizes the same 
language and which is called the canonical RFSA of L. The number of states of 
a canonical RFSA is exactly the number of non empty prime residual languages 
of the language it recognizes. Therefore, a canonical RFSA can be much smaller 
than its equivalent DFA: for example, the canonical RFSA recognizing 27*027" 
has n + 2 states while its equivalent minimal DFA possesses 2" states. Let A = 
{E,Q,Qo, F,S) be a canonical RFSA. The simplified canonical RFSA of La is 
defined by A' = (E, Q, Qq, F, S') with Q'q = (q G Qo \/3q' G Qo, Lq C Lq<} 
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and Vg' e Q, Vx G S, S'{q',x) = {q £ S{q',x) \/5q'' G S{q',x), Lq C Lqn}. this 
automaton recognizes La- All References about prime residual languages and 
RFSA can be found in | DLT01| . 

Example 1. Let A be the minimal DFA described in Fig. [U The residual 
languages are Lg^ = e + 00+ + 0*1L7+, Lg^ = 0+ + 0*117+, Lg^ = ^7+, 
Lg^ = 0* + 0*li7+, Lq^ = E*. We have the following inclusions and compo- 
sitions Lgi C Lq^ , Lgj = Lqg U Lq^ , Lg^ = Lqg U Lqi U Lg^ U Lq^ . The prime 
states are qo,qi,q 2 - The saturated of A is described in Fig. [T] and the equivalent 
canonical RFSA and simplified canonical RFSA are described in Fig. [21 




Fig. 1. The minimal DFA recognizing e -|- 00+ + 0*1L7+ and its saturated RFSA. 




Fig. 2. The equivalent canonical RFSA and simplified canonical RFSA. 
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2.3 The Model of Learning from Given Data 



The learning model from given data has been introduced in |Gol78| where Gold 
proved that regular languages represented by DFA are polynomially learnable 
from given data (see also [Hig97 ). An example of a language L over S* is a 
pair (rt, e) where e = 1 if u G L and e = 0 otherwise. A sample of L is a 
finite set of examples of L. The size of a sample is the sum of the length of the 
words it contains. For any sample S of L, we note = {m|(u, 1) G S'} and 
S“ = {u\{u, 0) G Sj. 

Here, we only consider the class of regular languages REG. We consider three 
representation schemes: DFA, RFSA and NFA. 



Definition 1. \Hig9't\l We say that REG is semi-polynomially learnable from 
given data using representation scheme R if there exist two algorithms T and C 
such that for any target language L G REG and any representation r G R{L): 

— T with input r computes a teaching sample Sl whose size is polynomial in 
the size ofr, 

— for any sample S of L containing Sl, G with input S computes a represen- 
tation of L in time polynomial in the size of S. 

We say that REG is polynomially learnable from given data if it is semi- 
polynomially learnable and if T computes Sl in time polynomial in |r|. 

REG is polynomially learnable from given data using the representation scheme 
by DFA and RPNI is a learning algorithm in this framework [( )G92] . De la 
Higuera gives a necessary condition to be semi-polynomially learnable based on 
the following notion: 

Definition 2. We say that REG is polynomially characterizable using the rep- 
resentation scheme R if there exists a function T such that 

— for any language L G REG and any representation r G R{L), T(r) is a 
sample of L whose size is polynomial in the size of r, 

— for any pair of distinct languages {L,L') represented by (r, r'), L is not 
consistent with 'T(r') or L' is not consistent with T'{r). 



Proposition!. \Hig97j If REG is semi-polynomially learnable from given data 
using a representation R then it is polynomially characterizable using R. For any 
non empty alphabet, REG is not polynomially characterizable using representa- 
tions by NFA. 



3 Experimental Study of Residual Languages 

As RFSA are defined using the notion of residual language, a natural first step 
before building learning algorithm based on this representation is to study prop- 
erties of residual languages. We focus our study on two aspects of residual lan- 
guages that are important for RFSA. The first aspect we study is the number 
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of prime residual languages of a regular language, which is also the number of 
states of the canonical RFSA, and the second aspect is the number of inclu- 
sion relations between residual languages, as this notion is used in the definition 
of the transition function of RFSA. We study these questions from an exper- 
imental point of view. We consider here three classical representation schemes 
of regular languages: DFA, NFA and regular expressions, and we define natural 
way to draw randomly regular languages using these representations. We ob- 
served that results are very different regarding to the generation method which 
is used: when languages are generated using DFA, there are few inclusion rela- 
tions between residual languages, and few composed residual languages either. 
But when languages are generated using NFA or regular expressions, there are 
a lot of inclusion relations between residual languages, and the number of prime 
residual languages can be very small with regard to the total number of residual 
languages. Let us define below the protocol of our experiments: 

The procedure DrawDFA(nStates , pp) takes as input an integer nStates 
and a probability pp and outputs a complete DFA whose number of states is 
chosen randomly between 1 and nStates. The successor of any state reading 
any letter is chosen randomly among the set of states and each state is chosen 
to be final with a probability equal to pp. 

The procedure DrawNFA (nStates, nTrans, pj , takes as input two in- 
tegers nStates and nTrains, two probabilities pi and pp and outputs a NFA with 
nStates states. For any state and any letter, the number of possible transitions 
is chosen randomly between 0 and nTrans. Each state is chosen to be initial 
(resp. final) with a probability equal to pi (resp. pp). 

The procedure DrawRegExp(NbOp,p 0 , po> Pi> P* > P- > P+') takes as input 
an integer NbOp and 6 non negative numbers summing to 1 and outputs a regular 
expression which has at most NbOp operators; the root operator is chosen among 
{ 0 , 0, 1, -F, •,* } using the input probabilities. When the root operator is unary, 
the procedure is called with parameter NbOp-1 to build its argument and when 
it is binary, it is called with approximately (Nb0p-l)/2 on each branch. 

The main procedure Result uses one of the above procedure to draw regular 
languages and completes two arrays Tl[100 x 100] and T2[100 x 100]: Tl[n,i] is 
the number of prime residual languages of the Ah regular language drawn whose 
minimal DFA has a size equal to n (as a consequence, n is also the total number of 
residual languages of the language). T2[n, i\ is the number of inclusions relations 
between two distinct residual languages of this language. The procedure stops 
when arrays are completed. 

Fig. [3] and HI show curves corresponding to the procedure Result when reg- 
ular languages are drawn using DrawDfa(120,0. 1), DrawRegExpClOO, 0.025, 
0.05, 0.05, 0.125, 0.5, 0 . 25) and DrawNfa(30 , 2 , 0 . 1 , 0 . 1) . Other exper- 
iments have been performed with other values and similar results have always 
been obtained. These results indicate that the ratio between the number of prime 
residual languages and the total number of residual languages of a language - 
and therefore between the size of a canonical RFSA and a minimal DFA - is 
highly dependent on the method used to draw regular languages: drawing DFA 
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Fig. 3. Number of prime residual lan- 
guages. 



Fig. 4. Inclusions between distinct 
residual languages. 





Fig. 5. Number of transitions in the Fig. 6. Transitions in the simplified 

canonical RFSA canonical RFSA 



provides regular languages such that almost all residual languages are prime 
while drawing NFA or regular expressions provides regular languages such that 
most residual languages are composed, which also imply that there is a lot of 
inclusions relations between residual languages. This can be checked on figure U 
which also indicate that there is nearly no such inclusion relation in languages 
generated by DFA (the curve is nearly merged with the X-axes). 

Note that we use here the number of states of an automaton as a measure 
of its size. Its number of transitions can also be considered. Fig. 0 and |B| show 
curves corresponding to the number of transitions for canonical RFSA and sim- 
plified canonical RFSA for the same languages as above. Languages obtained 
using DFA have a number of transitions roughly equal to the number of transi- 
tions of the minimal DFA (languages studied here have 2 letters therefore their 
complete minimal DFA have 2 x n transitions), for other languages, simplified 
canonical RFSA usually have a number of transitions significantly smaller than 
minimal DFA. Therefore it is reasonable to say that simplified canonical RFSA 
is a more economic way than DFA to represent regular languages generated by 
non-deterministic represent ations . 

These results have to be explained but for now, we can only speculate on 
their explanations. Our hypothesis is that to generate a DFA with n state is 
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roughly equivalent to generate n distinct languages (one per state) and these 
languages have low probability to have any inclusion relation or to be composed. 
When a NFA of n states is generated, the n languages obtained also have a low 
probability to have inclusion relations, but languages associated to states of the 
corresponding minimal DFA are composition of these languages, which may be 
the reason why so many states are composed or have inclusion relations between 
them. A similar argument can probably be used for regular expressions. 

These experiments suggest two approaches in grammatical inference. The 
first one consists in identifying prime residual languages of target languages (we 
show in section 4.1 that this identification is not possible with a sample of size 
polynomial in the size of the canonical RFSA) . The second idea would be to seek 
after inclusion relations between residual languages: this is the approach that we 
develop in section 4.2. 

These results also raises the problem of a natural representation of regular 
languages: none of the previously described procedures is more natural or more 
artificial than the others. As a consequence, learning artificial benchmarks should 
not only be based on procedures that choose DFA. We also raise the naive 
question: do the regular languages occurring in practical cases belong to the 
first class (size (canonical RFSA) ~ size(minimal DFA)) or to the second class 
(size (canonical RFSA) << size(minimal DFA))? 

4 Learning Using Residual Languages 

In this section, we present two ways to use properties of residual languages in 
grammatical inference. The first approach that seems natural with regard to 
previous results would be to use the fact that prime residual languages of a 
regular language can be few among the set of all its residual languages. From 
this point of view, the canonical RFSA seems to be an interesting target for 
grammatical inference. However, we prove that the class of regular languages is 
not polynomially characterizable using RFSA (if the underlying alphabet has at 
least two letters). 

The other interesting property of residual languages that previous experi- 
ments point out is the fact that there can be a lot of inclusion relations between 
them. We present a new learning algorithm based on inclusion detection between 
residual languages and study its theoretical properties. These results precise the 
study made in |DLT00| . 



4.1 RFSA Are Not Polynomially Characterizable 

It has been shown in [Hig97] that the class REG is not polynomially character- 
izable using NFA even if the underlying alphabet contains only one letter. The 
proof cannot be directly used to show an analogous result for RFSA as it can 
be proved that the class of RFSA over a one letter alphabet is polynomially 
learnable from given data. Indeed, it can be shown that the size of the canon- 
ical RFSA of a regular language over a one-letter alphabet is not smaller than 
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the square root of the size of the equivalent minimal DFA and algorithms such 
as RPNI can be used to learn them. The number of states of an automaton is 
used as a measure of its size, but results presented below are still true if we 
use the number of transitions (as the number of transitions is less or equal to 
\Q\ X |g| X |r|). 

Proposition 2. The class of regular languages ouer a two-letters alphabet is not 
polynomially characterizahle using RFSA. 

Proof. Let pi,..., Pk be distinct prime numbers and let Lp^ = a* \ (a^’)* = 

I 0 < 771 < pi} , for each i, 1 < i < k. Let us consider the two 

canonical RFSA Ai and A 2 recognizing the languages 

Lai ~ aa~^ Lp^ and La^ ~ A P)Lp^. 

Automaton Ai has Si^iPi + fc + 1 states and automaton A^ has Si^iPi + k 
states. For any polynomial P, we can choose prime numbers such that n^^iPi 
is bigger than P{Sf^^pi + fc + 1). We verify that: 

Lai n a* = aa~^ 

k k k 

LA,na* = «(U Lp,) = a(U(a*\(a^‘)*)) = a(a*\ 

i—1 i—1 i—1 

Lax n b(a + by = La 2 H b{a + b)* 

La, n (a + = La^ n (a + 

Therefore any set S with a size smaller than Tl^^iPi verifies SI^La, = SI^La^ 
and S n La, = S' n La 2 ■ 

Suppose now that REG is polynomially characterizahle using RFSA. Let T be 
the function that computes characteristic samples and let P be a polynomial such 
that size{T{A)) < P{size{A)). Let k be such that Lli^iPi > P{S^^^pi + fc + 1) 
and let Si = T(Ai) and S 2 = T(A 2 ). La, is consistent with S 2 and La 2 is 
consistent with Si which is contradictory. □ 

With proposition [T] we obtain the following corollary: 

Corollary 1. The class of regular languages over a two-letters alphabet is not 
semi-polynomially learnable using RFSA. 

Although regular languages are polynomially learnable when represented by 
DFA, they are not polynomially learnable when represented by RFSA. In other 
words, if nr{L) is the number of residual languages of a language L and if np{L) is 
the number of its prime residual languages, a number of examples polynomial in 
nr{L) is sufficient to learn L whereas a number of example polynomial in np{L) 
is not sufficient. As our experiments showed that when regular languages are 
generated using non deterministic ways np{L) « nr{L), it would be interesting 
to know if it is possible to find an intermediary value of the learning parameter. 
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If we note p{L) the greatest depth of a prime residual language of L (i.e. 
p{L) = Max{Min{\u\ \ u~^L = R} \ R prime residual language of L} ), we can 
hope that there exists a learning algorithm polynomial in p{L) x Up{L) (which 
can still be small with regard to rir{L)). But this problem, or the more general 
problem to know whether there exists another intermediary solution is still open. 



4.2 Inclusion Relation Based Learning Algorithm 

Classical learning algorithms such as RPNI build a prefix tree acceptor from 
the positive examples, and evaluate whether languages associated with different 
states are equivalent; if so, they merge these states. Previous results showed that 
it could be interesting to go one step further and to look after inclusions of lan- 
guages instead of equivalences. We present here a learning algorithm (DeLeTe2) 
based on this approach. We first introduce its target automaton. This automaton 
has fewer states than the minimal DFA. 



Saturated subautomata of the minimal DFA. Let A = {S,Q,qo, F,S) 
be a minimal trimmed DFA. For every state q of A, we define Uq as being the 
smallest word of S* such that 6{qo,Uq) = q (so, Uq^ = e). We assume that 
Q = {<Zoj ■ ■ • I 9n} is ordered using Uq. In other words, qi < qj iff Uq. < Uq. . Let 
A® = {E, Q, Qg, F, (5®) be the saturated of A. 

For any word u, the automaton A„ is obtained from the saturated A® of A by 
deleting the states q such that Ug > u. It is defined by A„ = {E, Qg, F“, 5“) 
with Q'^ = {q & Q \ Uq < m}, Qg = Qg n F“ = F n 5“(g, x) = x) n 

There is a finite number of subautomata A„. When u is bigger than Uq^, the 
subautomaton A„ is A® itself. On the other way, what is the smallest u such 
that La = LaJ! 

Proposition 3. Let p he the greatest prime state in A. The word Up is the 
smallest word such that the automaton A„p is equivalent to A. 

Proof. All states greater than p in A are composed. The automaton A„ is 
obtained by saturation of A and reduction of the states greater than p. These 
two operations preserve the language recognized by the automaton A [DIAfll j . 

On the other hand, if u is smaller than Up, the subautomaton A„ does not 
contain p. Since p is a prime state, there exists a word w which does not belong 
to any residual language included in Lap- The word UpW is not recognized by 
the automaton A„. □ 

Note that the automaton A„^ only depends on the language La- It is possible 
to build examples where the automaton A„^ is exponentially smaller than A. 

Example 2. Let us come back to the automaton described in Fig.[H The greatest 
prime state is q 2 and A^^ is the canonical RFSA described in Fig. [2 
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A characteristic sample. As in the previous section, we denote by p the 
greatest prime residual language of the minimal trimmed DFA A. A sample is 
said characteristic if it provides complete informations about inclusion relations 
between the residual languages associated with the states smaller than p. 

We define SP{L) = {uq\q < p} and K{L) = {uqx\q < p, x G U, S{q, x) yf 0}. 

Definition 3. A sample S is characteristic for the automaton if 

— WuG SP{L) U K{L),uG Pref{S+) 

— SP{L) nLCS+, 

— Wu G SP{L), Vu G SP{L) U K{L), u~^L ^ v~^L 3w such that uw G 5'+ 
and vw G S~ . 

Let S' be a sample, let u, v G Pref(S^). We note: 

— M A u if no word w exists such that uw G S“*" and vw G S ~ , 

— u V if u ^ V and v ^ u. 

It is clear that for any sample S and any words u,v G Pref{S~^), we have 
u~^L = v~^L u ~ w and u~^L C v~^L ^ u < v. 

Lemma 1. Suppose that the sample S is characteristic for the automaton A^^, 
and let u G SP{L), v G SP{L) U K{L). We have u <v ^ u~^L C v~^L. 

Proof. Straightforward since Vu G SP(L),Vr< G SP{L) U K{L),u~^L % v~^L => 
such that uw G S'^ and vw G S~ . This implies that u v. □ 

Example 3. (continued) We have SP{L) = {e, 0, 1}, K{L) = {0, 1, 00, 01, 10, 11}. 
The smallest characteristic set is S = S+ U S“ where S+ = {£,00,11,010,10} 
and S~ = {0,1,01,001}. 

We have the following relations between elements of SP{L) and elements of 
SP{L) U K{L) (if u is the label of a row and v the label of a column, a word w 
in the array means that uw G S~^ and vw G S~). 





e 


0 


1 


00 


01 


10 


11 


e 


A 


e 


e 






A 


A 


{} 


¥ 


A 


A 


A 


A 


A 


A 


1 


1 


T 


A 


1 


A 


A 


A 



The conclusion of the previous lemma can be checked on this array. 

The DeLeTe2 Algorithm. We now present an algorithm that builds a NFA 
from a sample of a target language L. If the sample is characteristic, the al- 
gorithm builds the automaton Starting with an empty automaton, the 

algorithm considers prefixes of the sample as characterization of states. A new 
state is added to the current set of state when it is supposed to be non equivalent 
with previous ones. The transitions associated with the new state are added. 
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Input: a sample S' of a language L. 

Let Pref = {uq, . . . , Un} be the set of prefixes of S+ 
ordered using the usual order. 

Let Q ^ Qo = F = S = ib. 

Let u = e. 

Repeat 

If 3it' G Q such that u' 

Then 

delete uE* from Pref 

Else 

Q = Q U {u} 

Qo = Qo U {u} iiu~< e 

F = FL> {u} if u G S+ 

<5 = (5 U {{u' , X, u) I m' G Q, v!x G Pref, u' x >- u} 
U{{u,x,u') \u' G Q, ux G Pref, ux >- u'} 

End If 

Let u = next word in Pref 
until A = {E, Q, Qq, F, S) is consistent with S 

Output: The automaton A = {E, Q, Qq, F, 6). 



Example 4- (continued) On the previous example, the algorithm needs three 
steps to recover the target automaton. 

1. In the first step, the state e is added. As £ ^ £, the state e is initial. The 
word £ belongs to S+, the state e is final. There is no relation x y e or e y x 
for X G {0, 1} and x G Pref ; thus no transition has to be added. 

2. In the second step, the state 0 is added because 0 £. As 0 7 ^ £, the state 

0 is not initial. The word 0 does not belong to S'^, then the state 0 is not 
final. We have the relations £0 0, £l 0, 00 0, 01 0 and the relations 

00 )^ £, 00 0, 01 0. The corresponding transitions are added. 

3. In the third step, the state 1 is added because 1 9 ^ £ and 1 9 ^ 0. As 1 7 ^ £, 
the state 1 is not initial. The word 1 does not belong to then the state 

1 is not final. We have the relations £l 1, 01 1, 10 1, 11 1 and the 

relations 10 )^ £, 10 0, 10 1, 11 )^ £, 11 0, 11 1. The corresponding 

transitions are added. 

This automaton is consistent with S. 



Theorem 1. If the input of the DeLeTe2 algorithm is a characteristic sample 
for the subautomaton A^^, it outputs the subautomaton A^^. 

Proof. We can prove that, at each beginning of the loop, we have u < Up. At the 
beginning of the else part, it belongs to SP{L). Then the set Q is included in 
SP{L). Due to the definition of the characteristic sample and to lemma[Tl u is an 
initial state if m ^ £, i.e. u~^L C L ;uis a, final state if u G S~^, i.e. £ G u~^L. The 
added transitions are all transitions such that u'x >- u, i.e. {u'x)~^L D u~^L or 
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Fig. 7. The three steps. 



ux >- u', i.e. {ux)~^L D u'~^L. Thus, at the end of the loop, the automaton A is 
the subautomaton of the minimal DFA recognizing L. When the automaton 
A is consistent with the sample, we have the subautomaton A^^. □ 

Given a minimal DFA, deciding whether a given state is prime is a PSPACE- 
complete problem IDLTOll . Therefore, computing the smallest characteristic 
set of a given DFA is not feasible. However, it is always possible to com- 
pute within polynomial time from a given DFA some characteristic sample: 
use SP{L) = {uq\q G Q} and the corresponding K{L). So, the class of regular 
languages represented by DFA can be polynomially learned from given data by 
our algorithm. 



5 Some Experimental Results 

In this section, we compare our algorithm with other grammatical inference 
algorithms: RPNI |OG92j . and Red-Blue (RB) (by H. Juille and J. B. Pollack, 
implemented by K. Lang) which is a variant of RPNI that uses evidence driven 
state merging (EDSM, see [LPP98] L 



5.1 Implementation of DeLeTe2 

Here, we do not suppose that the input sample is characteristic, and so, to 
perform our experiments, we use a variant of the algorithm presented above. 
This variant always computes an automaton which is consistent with the input 
sample, and is more efficient than the previous algorithm. Modifications used 
here do not alter theoretical results: the algorithm used here is still a learning 
algorithm in the conditions mentioned above. 

Whenever the algorithm intends to use a A to modify the current automaton, 
it first checks that this modification do not entail inconsistency. In order to do 
this verification, it first considers all the A relations that are derived from the 
current one: as the inclusion relation is transitive, the set of valid A relation 
should be transitive too. For example, if q\ A 92 is a valid relation, and if we 
check whether q 2 A q^ is valid or not, then we also consider the relation q\ q^ 
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and modify the automaton accordingly. The same way, if qi A 92 , Lgi^ = x ^Lq-^ 
and Lq!^ = x~^Lq^ (because q'^ and (72 are successors of q\ and 92 on the prefix 
tree), we also consider the relation ^ ( 72 ■ If all these new ^ relations do not 
entail inconsistency with the input sample, they are all marked as valid and the 
modification of the current automaton is accepted. Also, whenever two states 
are considered equivalent, the algorithm merge them. 

5.2 Experimental Protocol 

We build several benchmarks using generating methods described in Section 3: 
DrawNFA(20,2,0. 1,0.1) have been used to generate NFA and DrawRegExp(50 , 
0.025,0.05,0.05,0.125,0.5,0.25) to generate regular expressions. Studies 
have also been made using the procedure DrawDFA: in this case DeLeTe2 has 
worse results than RPNI and Red-Blue, which can be understood easily con- 
sidering that, for this generation method, there are nearly no inclusion relations 
between distinct residual languages, so the approach we propose here is less effec- 
tive than algorithm using equivalence relations. A target language being drawn, 
examples are drawn the following way: we choose I randomly in [0, 15], and we 
create a word w of length I, each letter of w being chosen by flipping a coin. 

One experiment consists in generating a language, generating a training set, 
generating a test set containing 1000 words and training each algorithm on the 
training set. In order to have results significantly higher than majoritary vote, 
only experiments the generated language of which has more than 20 % of nega- 
tive examples and more than 20 % of positive examples in the learning sample 
have been kept. Each benchmark correspond to 30 experiments. Benchmarks are 
denoted by the representation chosen to draw its languages and the number of 
examples in the learning sample. 

On each benchmark, we compare the following algorithms: Majoritary Vote 
(MAJ), DeLeTe2 (DLT2), RPNI , and Red-Blue (RB). We compare them using 
two methods: first we observe average recognition rate of the output automaton 
of each algorithm on the test set, then we do matches (noted algol - algo2 on 
the table) where we count the number of experiments where one algorithm is 
better than another (in term of recognition rate), and we count a tie when the 
difference is not significative (using the Me Nemar test, see |l )ieh8J ). Results of 
those matches are noted: won_by_algol + won_by_algo2 + nb_tie. We also per- 
form basic statistical tests on each benchmark: rir{L) is the number of distinct 
residual languages of the generated language L, np{L) its number of prime resid- 
ual languages, ni{L) the number of inclusion relations between distinct residual 
languages of L and |A„p(L)| is the number of states of the target automaton 
of DeLeTe2. Average values for generated languages are indicated here for each 
benchmark. 

5.3 Results 

Against RPNI, DeLeTe2 has won 130 matches, it has lost 24 matches and there 
are 86 draws; against RedBlue, it has won 127 matches, it has lost 33 matches 
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Benchmark 


nfa.50 


nfa-lOO 


nfa_150 


nfa-200 


expre^_50 


exprej^.lOO 


exprej;_150 


expreK-200 


MAJ 


68.6 % 


68.7 % 


65.0 % 


67.9 % 


65.0 % 


66.7 % 


62.4 % 


62.4 % 


RB 


66.5 % 


68.5 % 


70.7 % 


70.8 % 


77.5 % 


82.0 % 


88.1 % 


90.9 % 


RPNI 


66.5 % 


68.7 % 


72.2 % 


71.0 % 


81.2 % 


82.5 % 


85.2 % 


90.6 % 


DLT2 


69.3 % 


74.4 % 


76.7 % 


78.9 % 


81.3 % 


91.4 % 


92.0 % 


95.7 % 


DLT - RPNI 


14 + 6 + 10 


19 + 3 + 8 


18 + 3 + 9 


23 + 1 + 6 


17 + 2 + 11 


19 + 1 + 10 


11+4 + 15 


9 + 4 +17 


DLT - RB 


16 + 8+6 


17 + 4 + 9 


19 + 4 + 7 


21+2 + 7 


9 + 8 + 13 


19 + 3 + 8 


15 + 3 + 12 


11 + 1 + 18 


[nr{L)\ 


126,4 


123,3 


131,6 


120,3 


6,8 


7,0 


9,7 


9,2 


[np(L)] 


22,1 


22,6 


21,4 


24,7 


5,8 


5,8 


6,7 


6,1 


[ni{L)\ 


2172,7 


2124,4 


2093,3 


1834,5 


16,4 


16,2 


38,3 


39,0 


[\AuAL)\] 


99.5 


94.8 


91.7 


110.0 


6.6 


6.7 


8.6 


8.3 



and there are 80 draws. So, we can say that DeLeTe2, while very basic, is better 
than the two other algorithms on benchmarks generated using NFA and regular 
expressions. Details on the experiments described in this paper can be found at 
http : //www . grappa . univ-lille3 . f r/~lemay/ alt 01/. 

6 Conclusion 

The most classical strategy used in grammatical inference of regular languages 
consists in identifying words which define identical residual languages and then 
merging the corresponding states in the current automaton. This strategy natu- 
rally leads to build a DFA in order to identify the target language. We have pro- 
posed here an alternative strategy: look for inclusion relations between residual 
languages and then saturate the current automaton. This new strategy naturally 
leads to the RFSA representation of regular languages. Both theoretical and ex- 
perimental results given in this paper show that this new approach is interesting 
and promising. 

This paper also raises the problem of representation of languages: properties 
of randomly generated regular languages highly depend on the representation 
used to generate them. Two families of languages are highlighted here : in the 
first family, most residual languages are prime and there are few inclusion rela- 
tions between them, in the second one, most residual languages are composed 
and there are many inclusion relations between them. Both those families should 
be studied in benchmarks. An interesting question not studied here is the prob- 
lem of practical cases. We can assume that some cases are mostly constituted by 
languages of the first family, whereas other cases are mostly composed of lan- 
guages of the second family. This could determine the kind of learning algorithm 
to use. 
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Abstract. Biichi automata are used to recognize languages of infinite words. 
Such languages have been introduced to describe the behavior of real time 
systems or infinite games. The question of inferring them from infinite 
examples has already been studied, but it may seem more reasonable to believe 
that the data from which we want to learn is a set of finite words, namely the 
prefixes of accepted or rejected infinite words. We describe the problems of 
identification in the limit and polynomial identification in the limit from given 
data associated to different interpretations of these prefixes: a positive prefix is 
universal (respectively existential) when all the infinite words of which it is a 
prefix are in the language (respectively when at least one is) ; the same applies 
to the negative prefixes. We prove that the classes of regular co-languages 
(those recognized by Biichi automata) and of deterministic co-languages (those 
recognized by deterministic Biichi automata) are not identifiable in the limit, 
whichever interpretation for the prefixes is taken. We give a polynomial 
algorithm that identifies the class of safe languages from positive existential 
prefixes and negative universal prefixes. We show that this class is maximal for 
polynomial identification in the limit from given data, in the sense that no 
superclass can even be identified in the limit. 



1 Introduction 

Grammatical inference [5, 7, 11] deals with the general problem of automatic learning 
machines (grammars or automata) from structured data, and more usually words. 
Between the different syntactic objects from formal language theory, most attention 
has been paid to the case of deterministic finite automata {dfa), even if some results 
on different types of grammars are known. On the other hand the question of learning 
automata on infinite words has hardly been studied. 

The study of these automata was motivated by decision problems in mathematical 
logic. They provide a normal form for certain monadic second-order theories [4]. 
Later work concerned the relationship between these automata and the semantics of 
modal and temporal logics [14]. Today, these automata are used to model critical 
reactive systems. By reactive is implied a software whose purpose is to interact with 
its environment, and by critical one where mistakes or anomalies can have serious 
consequences, that can cost much more than the actual benefit made by the software. 
This is the case for instance of automatic pilots, operating systems or nuclear station 
automatic supervisors. 
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The development of such software requires automatic program proving capacities. 
It is wished in particular that properties known as safety, which expresses that 
something bad will never occur during the execution of the system, are verified. 
Current examples of safety properties are mutual exclusion or deadlock avoidance [1]. 
These properties are described formally in temporal logics like PTL (Propositional 
Temporal Logic), whose models, Kripke structures, can be modeled by Biichi 
automata [14]. Consequently, Biichi automata make it possible to model with the 
same formalism the critical systems and the logical properties that they must satisfy 
and to develop effective proof algorithms (model checking). 

Nevertheless, the formal specifications of the critical software, and more still, their 
properties, are difficult to write for a non- specialist of automata and temporal logics. 
Let us take the example of a lock chamber with two gates giving access to a safe 
deposit. One enters the lock chamber by gate 1 and one leaves it by gate 2 (or vice 
versa), but gate 2 should be allowed to open only if gate 1 is closed (and vice versa). 
This system is represented by the automaton below: 




Fig. 1. A two-gate lock chamber (o=open, c=closed) 

The safety property "gates 1 and 2 are never open at the same time" is written, in 
PTL: Q(not not p^), where property p. is that "gate i is open”. If a non-specialist is 
not able to describe a system and its properties, he may be able on the other hand to 
give examples of "good” and "bad" behaviors of the system. These examples are 
sequences of events, o, c^ o^ o^ o^ Cj... and o^ o, c,..., which are "good" 
behaviors, or o, o^... and o^ c, o^..., which are "bad" behaviors. The same applies to 
the logical properties the system must satisfy. Our objective is thus to learn 
automatically the Biichi automaton by collecting only positive and negative examples. 

The problem of learning automata on infinite words poses a first delicate problem: 
whatever the way of recovering the data (batch of examples, on line learning, use of 
an oracle or a teacher), is it reasonable to consider data which would be infinite 
words? Let us recall that with an alphabet of size 2 the set of infinite words is already 
uncountable. In previous research, the choice was to use data coming from the 
countable subset of the ultimately periodic words (of type mv“, u and v being finite 
words). Saoudi and Yokomori [12] define a (restricted) class of local languages, and 
prove the learnability of these languages from positive examples; Maler and Pnueli 
[9] adapt Angluin's L* algorithm [2] and make it possible to learn a particular class of 
automata with the assistance of a polynomial number of equivalence and membership 
queries. 

Nevertheless, we wish the learning of an automaton to be done from experimental 
data received from the potential users of a system. The data will therefore necessarily 




366 



C. de la Higuera and J.-C. Janodet 



be finite words. And the interpretation of these words can vary. A finite word u can be 
a positive prefix, in the sense that one will be able to say that all its (infinite) 
continuations are good, or that one of its continuations at least is. The same kind of 
interpretations exists for the negative prefixes. 

In this article we are thus interested in the inference of various types of machines 
on infinite words, from prefixes. In section 2 we will give the definitions concerning 
the co-languages, and in section 3 those necessary to the comprehension of the 
learning problems. In section 4 we establish several learnability results, by showing 
that for the majority of the alternatives, identification in the limit of the classes of co- 
regular languages and co- deterministic languages is not possible. A positive result 
concerning the polynomial identification of safe languages is given. 



2 Definitions 



2.1 Finite Words, Languages, and Automata 

An alphabet E is a finite nonempty set of symbols called letters. E* denotes the set of 
all finite words over E. A language L over E is a subset of E*. In the following, letters 
are indicated by a, b, c..., words by u, v,.., z, and the empty word by X. Let N be the 
set of all non negative integers. 

A deterministic finite automaton {dfa) is a quintuple A=<Q, E, 5, F, q^> where E is 
an alphabet, 2 is a finite set of states, q^e Q is an initial state, 5: 2 is a 

transition function, and f c 2 is a set of marked states, called the final states. 

We define recursively: 

■ S(qi,A) = qi 

■ S(qi,a.w) = S(S(qi,a),w) 

L(A), the language recognized by automaton A is {wgE*: 6{q„, w)gF}. 

It is well known that the languages recognized by dfas form the family of regular 
languages. This class is considered as a borderline case for grammatical inference [7]. 



2.2 Infinite Words and co-Languages 



We mainly use the notations from [13]. 

An infinite word u (or co-word) over E is a mapping N— >E. Such a word is written 
m(0)m(1)...m(m)..., with u{i)e'L. E® denotes the set of all co-words over E. An co- 
language over E is a set of infinite words, thus a subset of E®. 

Let L and ^Tbe two languages over E. We define: 

L®= I MG E® / M=M„M, . . . : V/g N MG L } and 
KL'^={ue E® / u=u^u^\ u^eK and m^g L®} 
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An co-language L is co-regular ijf there exists two finite sequences of regular 

i=n 

languages <A > and <B> ,g|^i such that L= (J A^Bj^ . 

i=l 

Let Pref(u) denote the set of all finite prefixes of an infinite word u. 

Given an co-language L, Pref(L)= U Prefiu). 

ugL 



2.3 Automata on Infinite Words 

Biichi automata [4] are used to recognize languages of infinite words. These 
languages are actually used to model reactive systems [14] and infinite games [13]. 

A Biichi automaton is a quintuple A=<Q, Z, 5, F, q„> where Z is an alphabet, Q is a 
finite set of states, q„e Q is an initial state, 5: QxZ^2® is a transition function, and 
F c 2 is a set of marked states. 

A run of A on an co-word m is a mapping C : N— such that: 

(0 C (0)= q, 

(ii) V/G N, C„(c+1)g5(C„(0, m(0) 

Note that is undefined if at some point CJ^i) is undefined. 

An CO- word u is accepted by A iff there exists a state of F which appears infinitely 
often in a run of A on u. Let L(A) be the set of all accepted co-words by A. We can 
show [13] that an co-language L is co-regular ijfL=L{A) for some Biichi automaton A. 

An automaton is deterministic ijf\ 6(q, a) | < 1 for all states q and letters a. 

Let Reg(o(Z) be the class of all co-regular languages and Det(a(Z) the class of all co- 
languages which are recognized by a deterministic Biichi automaton. Unlike what 
happens in the case of finite automata, Deto)(Z) c Rego)(Z) but Det«,(Z) R^o,(Z). 
Indeed, consider the language (b*df‘ of words with an infinite number of a. This 
language is accepted by the deterministic automaton 2a below but its complementary 
{a+b)*y^ is not deterministic, although it is recognized by the non deterministic 
automaton 2b. 



h 




Fig. 2. Biichi automata 2a and 2b recognize and (a+b)*b’‘'. Their marked states are in 

gray. 
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2.4 co-Safe Languages and DB-Machines 

An co-language L is safe [V\ijf 

VwgE®, (\/uePref{w), 3vgE“: uveL) wgL 
ie 

VwG E®, Preflw) c Pref{L) => wsL 
that is to say, 

VwG E®, (3 MG Pref(w): Vvg E® uvi L) 

Let Safe(o(E) denote the class of all safe co-regular languages. 
ii*fl® is not a safe language. Indeed, every prefix b" of Z?® (which is not in the 
language) is a prefix of Z>"a® (which is in the language). On the other hand, Z>*a® -i- Z>® 
is safe. It follows that Safe(„(E) Detm(E) and we are going to show (Theorem 1) that 
SafCo)(E) c Det„(E). 

A DB-machine is a deterministic Biichi automaton where F=Q. 

Theorem 1. L is a safe co-regular language iffL is recognized by a DB-machine. 

We introduce the following definitions in order to prove the previous theorem: 

Definition 1. PcE'* is a regular prefix language if and only if: 

1. P is regular; 

2. every prefix of a word of P is a word ofP: Vue£* VaeZ: uaeP=> ueP; 

3. every word ofP is a proper prefix of another word ofP: VusP 3aeZ: uasP. 

Definition 2. A dfa A is a prefix automaton (or prefix dfa) if and only if 

1. every state is final; 

2. every state is alive; VqeQ, 3aeZ: S(q, a)eQ. 

Proposition 1. 

1. IfL is an co-regular language, then PrefiL) is a regular prefix language; 

2. if P is a regular prefix language, then there exists a prefix automaton which 
recognizes P; 

3. if A=<Q, E, d, Q, q„> is a prefix automaton, then the language L(M) recognized by 
the DB-machine M=<Q, E, S, Q, qf> is co-regular and satisfies L(A)=Pref(L(M)). 

Proof. Notice that several different co-languages can have the same prefix language. 

1) Let P=Pref[L). L is co-regular, so there are sequences of regular languages <A > 

i=n i=n i=n 

and such that L= [jAiBj^ . Prefi U )= U Pf^fiA)'^^fi‘PrefiB) is a 

i=l i=l i=l 

regular language which is closed by prefixes. Let mgP. As P=Pref(L), there exists 
VG E® such that mvg L. Let a be the first letter of v. Then ua is a prefix of uv, so uae P. 

2) Let P be a regular prefix language. P is recognized by a dfa A which is minimal but 
not necessarily complete (ie, we remove its dead-state if necessary). As P is prefix, 
every state of this automaton is final. Finally, let 17 be a state of A and u a word such 
that 6(q„, u)=q. By the definition of a prefix language, there exists «g E such that 
M«G P. So h{q, a)e Q, thus q is alive. 
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3) Let A=<Q, Z, 5, Q, q^> be a prefix automaton. Consider the corresponding DB- 
machine M=<Q, Z, 5, Q, q„>. Let us prove that PreJ{L(M))=L(A). Let uePreJ{L(M)). 
Then there exists wgZ® such that uweL(Af). It is clear that 6(qg, u)eQ, so ueL{A). 
Conversely, let ueL{A) and q=6{qg, u). As q is alive, we can build two words v and w 
such that 6(q, v)=q ’ and h{q ’, w)=q Clearly, the run goes infinitely often through 
state q’. So uvw'^sLiM) and u&Pref{L{M)). 

Proof of Theorem 1. Let L he a language recognized by a DZ?-machine 
M=<Q, Z, 5, Q, q^ and we Z®. Assume that every prefix of w can be continued 
into a word of L recognized by M. The mapping C_: N^(2 such that C„(0)=^„ and 
V/gN, CJi+l)= ufi))= biCJi), w(i)) is a run of M on w. Since all the states of 

M are marked, this run is successful, so weL. Hence, L is a safe co-regular language. 
Conversely, let L he a safe co- regular language. By Proposition 1, Pref(L) is a regular 
prefix language which is recognized by some prefix automaton A=<Q, Z, 5, Q, q„>. 
We claim that L is recognized by the DB-machine M=<Q, Z, 5, Q, ^„>. Indeed, by 
Proposition 1, the language L{M) satisfies Pref{L(M))= L{A). Moreover, hy the first 
part of this proof, L{M) is a safe language (since M is a DB-machine). So L and L{M) 
are both safe languages such that Pref{L)=Pref{L{M))=L{A). Assume that there exists 
a word w in L and not in L(M) (or vice-versa). As Pref{L)=Pref(L{M)), every prefix of 
w is in Pref(L(M)). Since L(M) is a safe language, w itself is in L{M), which is 
impossible. So L=L{M). 

Corollary 1 , Let L and L’ be two safe co-regular languages. Pref(L)=Pref(L’)<^L=L’ . 
Proof. <= is straightforward. is an immediate consequence of the previous proof. 



3 Learning co- Regular Languages from Their Prefixes 

One of the main difficulties consists in explaining the meaning of "p is a positive 
prefix of the co-language L" and "n is a negative prefix of the co-language L" . The 
meaning of prefixes and the interesting cases to be studied depend on the context of 
our problem. 

Definition 3. 

1. p is an 3-positive prefix ofL iff Buelf, push 

2. p is a V-positive prefix ofL iff Vue If, pueL 

3. n is an 3-negative prefix ofL iff 3uelfi, nuVL 

4. n is a V-negative prefix ofL iff Vuelfi, nugL 

Given an co-language L, let P\/{L) denote the set of all V-positive prefixes of L, 
PsiL) the set of all 3-positive prefixes of L, NfiL) the set of all V-negative prefixes of 
L, and NfiL) the set of all 3-negative prefixes of L. 

Two finite sets B-t- and S- of finite words form together a set of (p, n)-examples for 
an co-language L if and only if S-v^PfL) and S-^NfL). 

For instance, on the automaton 2a, L=((a-t-i>)*a)®, P\f(L)=NfiL)=0 and 
B3(L)=A3(L)=ZL 
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We can also remark that for all co-languages L, P^{L)=Prej\L) and 
P3{L)r^N^{L)=PM{L)^^N3{L)=<Zl P3{L)yjN^{L)=P^{L)vjNs{L)=-L* 

P^{L)=N^{1P\L) P^{L)=N3&^) 

3.1 On Convergence Criteria 

In this section, we adapt the definitions of Gold [5] and de la Higuera [7]. Other 
paradigms than identification in the limit are known, but they are often either similar 
to these or harder to establish. A comparison between different models can be found 
in [11], 

It will be useful to systematically consider a class L of languages and an associated 
class R of representations. The latter one will have to be strong enough to represent 
the whole class of languages, i.e. VLe L, 3rG R: L(r)=L. 

The size of a representation (denoted |r|) is polynomially related to the size of its 
encoding. In the case of a deterministic automaton, the number of states is a relevant 
measure, since the alphabet has a constant size. 

All the classes we consider are recursively enumerable. Moreover, for Biichi 
automata and finite words, given xg{ 3, V}, the problems "wePJ^L(A))l" and 
"weNXL(A))l" are decidable, so the definition of identification in the limit from 
prefixes can be presented as follows: 

Definition 4. A class L of co-languages is (p, nfidentifiable in the limit for a class R 
of representations if and only if there exists an algorithm A such that: 

1. given a finite set <5-t, S-> of prefixes, with S-t-^PJL) and S-^NJL), A returns h in 
R consistent with <5'h-, S->; 

2. for all representations r of a language L in L, there exists a finite characteristic set 
<CS-v, CS->, such that, on <S-¥, S-> with CS-\-aS-\-aPfL) and CS-^S-^NJL), A 
returns a hypothesis h equivalent to r. 

We now adapt the definition of polynomial identification in the limit from fixed 
data [5, 7] to the case of learning from prefixes. This definition takes better care of 
practical considerations: for instance with this definition, deterministic finite automata 
are learnable whereas context-free grammars or non-deterministic automata are not. 

Definition 5. A class L of co-languages is (p, nj-polynomially identifiable in the limit 
from fixed finite prefixes for a class R of representations if and only if there exists an 
algorithm A and two polynomials CX( ) and ) such that: 

1. given a set <5'h-, S-> of prefixes of size m\^with S-t-^PfL) and S-^NJL), A returns 
h in R in 0( CX(m)) time and h is consistent with <5-t, S->; 

2. for all representations r of size n of a language L in L, there exists a characteristic 
set <C5'h-, CS-> of size at most f(n), such that, on <S-¥, S-> with CS-t-^S-t-^PfL) 
and CS-^S-^NJL), A returns a hypothesis h equivalent to r. 



* The size of a set S of finite words is the sum of the length of all the words in S. 
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3.2 The Problem of Learning co-Languages from Their Prefixes 

We have now defined the different parameters of the problem. The main question is: 
can the class L of co-regular languages represented hy R be learned following the 
criterion C from a set of (p, «)-examples? 

The classes L we are interested in are those defined in section 2. The representation 
classes are Z?-Aut (Biichi automata) for Reg(o(2), DB-Aut (deterministic Biichi 
automata) for Detm(2) and DB-Mach (DB-machines) for Safe^CS). The criteria will be 
identification in the limit and polynomial identification in the limit from fixed 
prefixes. The examples of positive and negative prefixes will be defined according to 
the different combinations of the quantifiers 3 and V. 

Hence a learning problem will be completely specified when given: 

1. the class of languages and its representation class; 

2. the convergence criterion; 

3. the interpretation one gives to positive and negative prefixes. 

A problem will thus be a triple <Lr, criterion, interpretation> where criterion will 
be idlim (identification in limit) or polyid (polynomial identification in the limit from 
fixed prefixes) and interpretation will be a pair (p, n) such that p and « e {3, V}. 

Example. The problem <SafeK,(2)^j idlim, (3, V)> is the one of identification in 
the limit of the class Safeo)(2) where the languages are represented by DB-machines 
and a presentation made of existential positive prefixes and universal negative 
prefixes (see definition 3) is given. Such a problem will have a "positive status" if this 
class is actually learnable with the chosen criterion, a "negative status" if it is not and 
an "unknown status" if the problem is unsolved. 



4 Results 



We give two types of results. The first concerns classes Regm(2) and Det(o(2), for 
which identification in the limit from prefixes is impossible. The second concerns the 
class of safe languages, for which polynomial identification in the limit by fixed 
prefixes is proved. 



4.1 General Properties 

We first give a straightforward reduction property; we establish that polynomial 
identification only holds when identification in the limit also holds: if 

<Lr, idlim, sign> has a negative status, so does <Lr, polyid, sign>. 

A necessary condition for the identification of a class of languages is that any pair 
of languages from the class can be effectively separated by some prefix: 

Lemma 1. Let h be a class of co-languages and R a class of representations for L . 
If there exist L^and L^in L such that P (LJ=PfLJ and NJLJ=NJLJ, then the 

problem <Lr, idlim, (p, n)> has a negative status. 
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Proof. Suppose that an algorithm A identifies class L; then L, and L, have respective 
characteristic sets CS^ and CS^. But L^ and L, are consistent with C^jUCSj. Hence 
either or is not identified. 

Theorem 2. For any class of representations R, Vp, nE{3, Vj, <RegcfE)^ idlim, 
(p, n)> and <DetJE)^, idlim, (p, n)> have negative status. 

Proof. We will use the same counter-example, shown in Figure 3, to prove that 
neither the class of all co-regular languages, nor that of all co-deterministic ones are 
identifiable in the limit (and furthermore polynomially identifiable from given 
prefixes). The languages accepted by automata 3a and 3b are respectively 
L=a'"+a*ba*b(a+b)'" and L=a*ba*b{a+b)'^ . Whatever the choice of quantifiers p and 
n, languages and are identical in both cases. 



o. a a b a a a b 




Fig. 3. Automata 3a and 3b accept respectively languages a“+a*ba*7>(a+fo)“and a*ba*b{a+bf'. 



4.2 On the Identification of Safe Languages 

The previous result is very negative, but hardly surprising. It implies that learning 
requires either to consider a subclass of languages, and/or to change the convergence 
criterion. It is surely not reasonable to choose a less demanding criterion than 
identification in the limit; we will thus concentrate on a subclass of co-deterministic 
languages in the sequel: the safe co-languages. We first prove that the associated class 
of prefix languages is polynomially identifiable in the limit from given data: 

Proposition 2. The class of regular prefix languages, represented by prefix dfas, is 
polynomially identifiable in the limit from given prefixes. 

Proof. To prove the above proposition we use algorithm /?PAT-prefixesl. An 
alternative and more efficient algorithm, that can return a compatible non trivial 
prefix automaton, even when the characteristic set is not included in the data is 
proposed in the appendix. As for /?PAT-prefixesl, it makes use, as a sub-routine, of 
RPNI [10] which can identify a dfa from positive and negative data (typically two 
finite sets of finite words S+ and S-). 

The first object RPNI builds is the prefix tree acceptor (pta): this is the largest dfa 
with no uselessQstates recognizing exactly S+. 



^ A state is useless if it does not lead to an accepting state, or is not accessible from the initial 



Formally: 

P3(Lf=P3(fi)=X* 

P\/m)=P\/m)=a*ba*b(a+b)* 



N3(Lf=NfLf=a*+a*ba* 

Av(L,)=Av(4)=0 



state. 
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Algorithm RPNI-pref ixesl 

Input: S=<S+,S-> (a set of positive words S+, and 

of negative words S-) 

Output: a prefix automaton (<Q, S, 8, F, qo>) 

Begin 

A<-RPNI(S+, S-) ; 

If A is a prefix dfa 
then return A 

else max_neg<— max { length (u) : ueS-}; 

For all w in S+ s.t. wePref(S-) and 
wg (Pref (S+) \{w} ) do 
Compute V of length max_neg s.t. 

Pref(v)nS-=0 and wePref(v); 

S+<— S+u{v} ; 

A<-PTA(S+) ; 

Q<— Qtl{qf}; F<— Q; 

For all a in E do 8 (qf, a)<— qf,- 
For all q in Q such that q is a leaf do 
For all a in E do 8(q, a)<— qf,- 
Return A 

end. 

If <5+, S-> contains a characteristic set of the target language L, RPNI returns a 
prefix automaton A that accepts language L [10]. If <S+, S-> does not contain a 
characteristic set, RPNI returns an automaton which is consistent with <S+, S->, but 
may be neither prefix nor even transformable into a prefix automaton. In that case 
/?PM-prefixes I transforms the pta into a consistent prefix automaton. 

Indeed function PTA(S + ) constructs the pta corresponding to S+ in which are 
added extra words whose positive labeling does not introduce inconsistency; testing 
(wEPref(S-) and wg Pref (S + ) \ {w} ) allows to know which states of the pta 
have no successors; these states must then lead to a new universaj^ state qf whenever 
the new transition is not used by some negative word: such a transition always exists 
since the data is supposed to be consistent. Building a polynomial implementation is 
straightforward. 

Theorem 3. polyid, (3, V)> a has positive status. 

Proof. We show that the conditions of definition 5 are met: 

i. Let L be a safe language. On any pair of sets <5+, S-> of (3, V)-prefixes for L, by 
proposition 2 a prefix dfa accepting S+ and rejecting S- can be returned in polynomial 
time. In constant time this automaton is transformed into a DZ?-machine M by 
changing the acceptance criterion. Furthermore S+^Pref{L{M)) and S- 
nPref{L(M))=Q). 



^ A state is universal if by any letter there is a transition to the same state. 



374 



C. de la Higuera and J.-C. Janodet 



ii. Let L be a safe language, and M a DZ?-machine accepting L. Let A be the prefix 
automaton associated with M. Let <C5+, CS-> be a characteristic set for A and RPNI. 
Let now <S+, S-> he such that C5+c5+, CS-^S-, 5+cL(A) and S-r>L(A)=0. Notice 
that the size of <C5+, CS-> is polynomial in that of A which in turn is the same as the 
size of M. On input <S+, S-> RPNI returns an automaton A’ equivalent to A. By 
construction, the DB-machine M’ associated to A’ is such that 
Pref(L(M’))=L(A’)=L(A)=Pref(L(M)). By corollary 1 L(M)=L(M’) holds. 

Theorem 4. If h strictly contains Safe of and R is a class of machines for L, 

<Lr, idlim, (3, V)> has a negative status. 

Proof. Let L be a class containing strictly Safe^CE) and L a language in L but not in 
Safeo)(E). Pa(L) is a prefix language. But in that case there exists a language L’ in 
Safeo)(E) such that P^{L)=P^{L’) and Ny{L)=N\/{L’). By lemma 1, it follows that L is 
not identifiable. 

Theorems 3 and 4 allow us to deduct a final result concerning learning from (V, 3)- 
prefixes. An co-language L is co-safe cj^its complementary E“\ L is a safe language. 
We denote Co-Safe(o(E) the family of co-safe co-regular languages. Co-safe languages 
are accepted hy co-DB-machines, i.e. complete Biichi automata with a unique marked 
state which is a universal state. 

Theorem 5. <Co-Safeof^)..„.DB.Mach’ polyid, (V, 3)> has a positive status. Furthermore 
for any class L strictly containing Co-Safe of ^), and R a class of machines for L, 
<Lr, idlim, (V,3)> has negative status. 

Proof. Any complete prefix presentation hy (V, 3) of a co-safe language L is a 
complete prefix presentation by (3, V) of the safe language lf\L, since 
P 3 (lf\L)=N 3 (L) and Av(E“\L)=Bv(L). Moreover the construction of a co-DB-machine 
from a DB-machine can be done in linear time by completing it with a universal state 
which becomes the marked state. From theorem 3, the problem <Safe(o(E)^^ 
polyid, (3, \/)> has a positive status, and so has <Co-Safem(E)^^^^ 
polyid, (V,3)>. 



5 Conclusion 

This work is a first approach to the problem of learning or identifying automata on 
infinite words from finite prefixes. A certain number of open questions and new 
research directions can be proposed. Among those we mention: 

The problem <?,, criterion, (3, 3)>. It is rather easy to show that for all the classes 
of languages studied in this paper, the status will be negative. It seems relevant to find 
a class of languages (undoubtedly rather restricted) for which the status would be 
positive. 

Learning from prefix queries (membership queries on the prefixes) and 
equivalence queries. 
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Improvement of the inference algorithm (RPNI-prefixes) for the learning of the 
prefix languages. The algorithm proposed is polynomial. It is however neither easy to 
implement, nor (probably) does it perform well in practice. 

Lastly, the validation of this algorithm on real data (produced by a system), 
remains to be done. The type of automata corresponding to real world tasks has the 
characteristic to have an important alphabet, but few outgoing transitions per state. In 
this context simplification by typing of the alphabet [3] is undoubtedly a track to be 
retained. 
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Appendix: A Constructive Prefix dfa Inference Algorithm 

The algorithm proposed in section 4 identifies polynomially and in the limit from 
given data any prefix automaton. It is nevertheless practically a useless algorithm: one 
is never sure to have a characteristic set inside his learning data, and returning the pta 
with some added edges is not convincing. We give here a specific prefix automaton 
learning algorithm. It is based on RPNI [10], and uses notations from [6]. 

Algorithm /?T’AT-prefixes2 adds to S+ all prefixes of 5+, and goes through a typical 
state merging routine. The only problem is to make sure that every merge leads to an 
automaton that will be completable into a prefix automaton. To do this each positive 
state has to stay alive: there must be at least one infinite word leading from this state 
that avoids every negative state. 

Algorithm RPNI-pref ixes2 

Input: S=<S+, S-> 

Output: a prefix automaton (defined by 6, F+, F-) 

Begin 

( * Initializations* ) 

S+<— S+UPref (S+) ; n<— 0; 

VaeZ, Tested (qo, a)<— 0; F+<— {qo}; F-<— 0; 

While there are some unmarked words in S+uS- do 
<q, a, q' ><— chose_transition ( ) ; 

If Possible (6 (q, a) =q' ) 
then 6 (q, a) <— q' ; 

For all unmarked w in S+ do 

If 6(qo, w) =q" then mark (w) ; 

F+<— F+u{q" } ; 

For all unmarked w in S- do 

If 6(qo, w) =q" then mark (w) ; 

F-<-F-u{q" } ; 

else Tested(q, a) <— Tested (q, a) u{q'}; 

If I Tested (q, a) | =n+l (*impossible to merge *) 
then (*creation of a new state*) 

n<-n+I; Q<-Qu{qn}; 8(q, a)<- qn/ 

For all unmarked w in S+ do 

If 6(qo, w) =q" then mark (w) ; 

F+<— F+u{q" } ; 

For all unmarked w in S- do 

If 6(qo, w) =q" then mark(w); 

F-<-F-u{q" } ; 

VaeZ, Tested (qn, a)<— 0; 

End while; 
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(* conversion into a consistent prefix dfa*) 
Q^F+ ; 

For all qeF+ such that VaGS6(q, a)gQ 
chose w minimal such that 

({u: 6(qo, u) =q} . Pref (w) ) ns- = 0; 

Q<— Qulq^: 0<i<|w|}; 

F+<— F+u{q^: 0<i<|w|}; 
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End. 

Function Chose_transition : returns a triplet <q,a,q’> corresponding to the 
transition 6(q, a)=q’ where 5{q, a) is undefined and q’iTested{q, a). Different 
functions can work. Typically EDSM type functions have been shown preferable [8]. 

Function Possible (5 (q, a)=q') : returns True if adding to 5 rule (q, a, q) 
does not lead to an inconsistency, False otherwise. 

Inconsistency is tested on the current automaton on which rule 6(q, a)=q ’ is added. 
It can have two causes: 

■ there exists two words uaw and vw such that h{q^, u)=q and 5 (^„, v)=q’ and 
MflWG S+, vwi S-, and uawi S-, vwg S+. 

■ a state is no more alive; a state q is alive if it can still lead to an accepting state: 
3 wgE“/({m: 6(q„, u)=q] .Pref{w))r\S-=0. This insures that the current automaton 
(and thus by induction the last one) can be transformed into a prefix dfa. 

The main elements of the proof of /?PM-prefixes2 are: 

■ The algorithm returns a prefix automaton (by construction). 

■ The possible test insures that all states are alive and that at any moment the 
automaton can be transformed into a consistent prefix automaton. 

■ In the case where a characteristic set (for RPNI) is included, no transformation will 
take place. 

■ Finally, the algorithm works in polynomial time. 

We refer the reader to [6] for a complete proof (in the case of dfas, but the proof 
can easily be adapted to the case of prefix automata). 
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